VP, SRE Operations & Problem Management Senior Analyst, Technology and Operations

Full Time, onsite
Dbs Bank
Singapore, Singapore

Salary undisclosed

Apply on

Linkedn

Original

Simplified

Business Function

Group Technology and Operations (T&O) enables and empowers the bank with an efficient, nimble and resilient infrastructure through a strategic focus on productivity, quality & control, technology, people capability and innovation. In Group T&O, we manage the majority of the Bank’s operational processes and inspire to delight our business partners through our multiple banking delivery channels.

Job Purpose

This position is for an SRE Operations & Problem Management Senior Analyst within the enabling group, Enterprise Architecture & Site Reliability Engineering (EASRE) department. This role is expected to effectively incident retrospective operations and in other SRE activities in general which pertains to maintenance management that includes availability, latency, performance, change management, monitoring, capacity planning & also the solutions offered derived from emergency response.

Key Accountabilities

Effective facilitation & conduct incident retrospectives (RCA) activities from end to end
Absorb new technology rapidly & apply effectively
Evaluate & demonstrate new cloud technologies as required
Communicate well with technical & non-technical colleagues
Mentoring of other colleagues, as necessary
Work to a high standard with agreed timescales
Undertake any other tasks or duties that are reasonable & requested by your supervisor or a member of the senior management team
Code reviewing
Ability to apply knowledge in supporting "Run" operations
Perform data analysis & provide suggestion on identifying Service Level Indicators & Service Level Objectives

Responsibilities

Responsible for effectively facilitating the Problem Management Process
Able to demonstrate authority in the RCA calls while coordinating with other stakeholders & solve the discrepancy in blameless ways
Responsible for efficient allocation of time & resources given parallel major incidents and problem activities
Point of contact for assigned incidents of higher severity (from incident retrospective calls all the way up to Management Report (MR) documentation and publishing
Manage the updates of systems such as problem management module, internal sharepoint, etc
Proposes & participates on the enhancement activities related to SRE
Collaborates with Engineering Teams within EASRE and with LOBs on enabling activities as part of the preventive measures
Develop event management process, metrics, and governance model
Perform trend analysis on events to identify potential issues/incidents
Consolidate and analyze noise alerts to zoom in on actual issues
Leverage GenAl and real-time data feeds to produce post-incident reports
Implement automatic root cause identification, reducing turnaround for RCA Reports
Coordinate & Automate incident thematic and trend analysis using AI/ML
Identify event/incident clustering for improvements

Requirements

Skills & Experience:

In depth understanding of Public/Private/Bybrid cloud solutions
Hands on experience with popular CI/CD tools like Jenkins, Nexus, SonarQube, Bitbucket etc
Good exposure to logging & monitoring tools like Dynatrace, Prometheus, Grafana, ELF/ELK
Good understanding of cloud native technologies like Containers, Kubernetes etc
Develop & enhance production monitoring & management capabilities leveraging existing platforms & tools
Minimum 10 years of root cause analysis (RCA) exposure & involvement leading discussions as a problem manager or incident commander
In depth understanding of Incident & Problem Management functions & activities
Good understanding of Identity and access management
Software incident & problem management
Work with stakeholders & command centre in trouble shooting, escalating & solutioning critical site incidents
Identify recurring system/ application issues & work with cloud team, infra teams, product development, vendors & other stakeholders in investigating & resolving cause
Maintain accurate documentation of incidents including impact details, timelines, steps taken for mitigation/resolution
Strong verbal & written communication skills particularly effective documentation skills
Prior experience in developing and implementing event management processes and governance models
Strong analytical skills with the ability to interpret complex data sets
Proficiency in event management tools and platforms
Familiarity with ITIL (Information Technology Infrastructure Library) practices related to Incident Management, Problem Management, Change Management and Event management
Experience with AI/ML technologies and their application in incident analysis

Desirable

Min 6+ yrs of software development or technical support or operations experience
Experience with Jira, Confluence
Basic knowledge of Linux/ Windows
Exposure to Enterprise databases e.g Oracle, SQL server, Maria DB, MongoDB & Sybase
Knowledge in systems & multi-tier application & network troubleshooting
Experience with load balancing principles
Essential knowledge & awareness of Public/Private/Hybrid cloud solutions
Good exposure to logging & monitoring tools like Dynatrace, Prometheus, Grafana, ELG/ELK
Preferred ITIL V4 certification
Trend Analysis and Forecasting
Process Development and Governance
Familiarity with GenAl (Generic Algorithm) or similar technologies
Continuous Improvement Mindset

Apply Now

We offer a competitive salary and benefits package and the professional advantages of a dynamic environment that supports your development and recognizes your achievements

Primary Location

Singapore-DBS Asia Hub

Job

Technology

Job Posting

Aug 20, 2024, 11:08:18 AM

Similar Jobs

13d ago

Maintenance Engineer

ST ENGINEERING LAND MRO & SERVICES PTE. LTD.