IAM & BO1 (BO1 (DevOps (Issue 3 - additional time-spent on solving…
IAM & BO1
Issue 3 - additional time-spent on solving incidents due to limited documentationIssue short name: ReworkIssue description:
- required knowledge about new changes is limited
- partly recognized by PM, Ops. Is involved in ad hoc basis, ops limited reliable to part of the team. Source Magda: Knowledge management is in place
- Source (Daniele Manzotti) About 30% of the incidents are related to incorrect data in migration template. 30 % of total or 30% of external incidents
- Source (…) additional time required to solve incidents,
- Source (Salvatore) First line fix rate is 40% (60% consists of reporting requests, configuration requests, root cause for job abend)
- Note (MDL) - 1st line is BO1 I presume? yes
CI Next steps
- Source 3/8 (OBA: Tommasso): Inst. Inst. Template, facilitates in the handover from test to deployment to prevent late involvement and lack of information transfer.
- Source (CI): DevOps team structure
- Source (CI): Confluence
- covered with operational handbook, no followup required DMB 05/09
- Check how much time is spent on solving incidents + increase due to missing information.
- Wouter & Gert: Execute meeting with scrum master and release coordinator
- Wouter: Follow up with project manager : project manager recognizes documentation is not always up to date, missing (testplan & implementation plan)
- Wouter: Validate that remaining 60% of incidents are fixed by Development (Alessandra Rossanjo)
- Gert: Investigate confluence solution
Issue 2 - Contaminated QA environment leading to preventable incidents: UAT environment is contaminated with versions that are different from the production environment. This means that tests performed in UAT environment are not representative for the behavior of the software in prod. This leads to incidents as a consequence that could have been prevented.Root cause:
BU controls all environments up until and including UAT. Releases are done every week and not in all instances synchronized with production. Meaning UAT is not production-like. Source: PM: Gian Mario: Specification of differences in environment: Key differences, slow test environment, not fully automated, quality of testing, data is different.
Validate reason of not being in Sync Impact:
- time-spent by OPS on solving preventable incidents in production, due to testing on a non-production like environment.
- Average time spent on known incident 1 hours, time spent on unknown incident 2 hours. 60-70% known issues (343 incidents). 30 % new issues (137 incidents)
New environment could solve about 20% of the incidents: Pro rata 15% (=73 incidents) of known issues and 5% of new issues (=22 incidents)
--> (73 1 hour) + (22 2 hours) = 117 hours = 0.07FTE
- 1 Emergency change every 2 days (220 working days in 1 year = 110 emergency change). Average time spent on emergency change 1-2 hours: 110*1.5hours = 165 hours = 0.1FTE
--> 0.07+0.1 = 0.17FTE
Investigate investment of implementing new environment source: Salvatore 04/09
--> Intention is that RLSE-environment is only used by BU to do non-regression tests of all the changes and that OPS is responsible for keeping the RLSE production-like (Russo is working on procedure to maintain this)
- Launch new environment "RLSE", owned & maintained by OPS - in progress, will be launched end of 2018 (Owner: Michele Russo, stakeholder: Salvatore De Rosa)
--> This non-regression testing should be automated to ensure as most coverage as possible (Possible with HPALM?)
--> Because of automated deployment with Endeavor the extra deployment time needed is 10min/change*nr of changes
- Solution is to use same process to deploy in production as for implementing changes in UAT. Limit access for BU. Source Salvatore De Rosa
De rosa: we are responsible for RLSE and Prod. 10 min more work per change because of new environment. Currently checklist exists to perform test coordination(Daniele Manzotti for webservices/Ezio frgierio for batches)
- Plan meeting with Frigeni and testers in OPS to discuss automation of regression testing using Robot framework
- Additional client impact (check with Christian Putelli): Meeting on 06/09
- Obtain quantification with expert from BU about rework done by developers to fix incidents
RLSE is potential solution but it should be kept production-like to be implemented end of 2018Rossini: Cams test coordinator. different projects, one methodology.*CI OWNER - GERT
Issue 4 - Compromised testing phase - leading to preventable incidents Development takes longer to develop the package. Then testing time gets squeezed, but in testing there are a lot of bugs found. These bugs need to be fixed by development and then tested again. This leads to less time to really test all functionalities, so only most important functionalities are tested, leading to incidents in production.
Other consequence is that the timeline of approval is violated in 80% of changes (2 weeks before deployment changes have to be approved by NEXI, and CAB). NEXI recognizes the need for fast approval and allows late approval. Only issue is that the client behind NEXI also has to approve the change.Potential root causes:
- Not enough time to test changes to the desired level
- Certain changes are moved up the release planning after request of NEXI. --> Source: Interview Riccardo Frigeni by Maartje
- Testing is compromised, reducing test coverage, leading to time-spent of OPS on solving preventable incidents.
--> internal incidents: suboptimal regression: 30% of all 490 incidents in H1/2018: 147 incidents x 0.5hours/incident = 73.5 hours/semester x 2= 147hours/year
--> external incidents: suboptimal regression: 30% of 130 incidents/semester = 37 incidents 5hours = 185hours/semester 2 = 370hours /year
--> emergency changes due to incident: 30% of 66 emergency changes/year = 32*1 hour = 32 hours/year
--> Total impact: 147 hours + 370 hours + 32hours = 549 hours/year = 0.33FTE
--> Source: Interview Salvatore De Rosa/Daniele Manzotti & Daniele Manzotti 07/09
- Impact confirmed by project manager. They sacrifice time for testing (not testing certain minor functionalities of change) --> Source: Interview Gianmario Felicetti 04/09
-SLA penalties (Check with ODM: Christian Putelli: SLA overview will be mailed by Putelli in week 37, Wouter follows up)
Increasing automation of part of (integration) testing, while still maintaining high quality of testing and be able to maintain SLA
Test automation will allow us to test everything and prevent incidents in production(below needs clarification DMB 05/09)
CI OWNER GERT'
- Project management uses an intake document. (More information SLA from project managers. request Tomasso Tortorra
- Plan session with Vernon Crabtree and testers from BU and OPS to discuss automation
Issue: rework resulting from limited involvement from OPS Resolving preventable incidents. Due to limited involvement of BO1
a. analysis phase (provide input how solutions can be made without disrupting the service)
b. test phase (provide input which kind of test should be done).
Result low quality software, little knowledge on what is implemented. resulting in rework.Root cause:
- Limited involvement of OPS, starting from analysis, during build and test phase
- BU has limited technical knowledge/ skills of RUN to provide correct information
- root cause is not quality of software from changes, but from technical debt on the environment of Nexxi. For example SAANA project has no incidents. *Source: release coordinator Michele Russo:
Source 11/09 (Alessandra Rosanigo) the test coverage is determined outside the development phase at the intake phase.(only the client testing is determined in the intake phase. Internal testing is determined by analists)
- 490 job abend incidents in 2018 (source report Salvatore de Rosa), about 30 minutes time spent per issue (mostly at night) (source Daniele Manzotti 0709) Dev Impact)
- 130 external incidents in past 12 months, (source report Marcel Doornink, Daniele Manzotti 0709. Time spent between 1 hour & 2 weeks for OPS. Dev impact?
- 66 emergency changes in 2018 to fix software source Salvatore 1hr for OPS (Daniele Manzotti, 0709) impact for Development?
- Low fix rate in Bo1 (40%), 60 % needs to be solved by development, the 60% consists of: report requests, configuration changes, incident requests, root causes for "job abend" messagesSource Salvatore
- involve OPS in agile development team of project Aurora. Project Aurora is working in Agile mode.
- Short term:
1 Involve OPS in Refinment meeting of Aurora & Include OR items in product backlog Source: Christina Trevisson
2 Set-up Devops pilot for application Securaza. -> awaiting confirmation of Alessandra with Corado Pelloli
- medium / long term: align DevOps solution with Agile team working on normal changes in BU Issueing (Alessandra)
CI Owner: Wouter *Related:
- Align OPS representative for Aurora project
- Set-up Definition of Ready & Done for Aurora project. (Done)
- Interview OPS employee on Securaza project (Done)
- Identify gain for Securaza application
- Propose plan for pilot (Scrummaster!)
(CI) (Maartje) Ensure impact from operational readiness. to avoid double counting look up in RSB. After Karin Duijnkerke incidents related to changes. Background Information:
Source: 3/9 (MDL): (TM) Riccardo: OBA’s value add is limited, since they don’t contribute to the HIA – direct communication between BO1 and analyst would be faster
Source 1/8: (MDL): (PM) Simona & Carlo The OBA is coordinates with operations for operational requirements & Impact analyses. The issue is mostly in the small / medium changes when it is placed in a project the alignment is ok.
Source 03/09 (DML): (RC + SA): Involve operations in (3) monthly meetings regarding release planning, content and issues and involve them in all the decisions (calendar is set): one meeting to challenge regression test, release no regression tests definition meeting, one meeting before the release, release closure meeting, one Retrospective meeting after the release release retrospective meeting
--> Issue 3: Additional time spend to retrieve information
Additional time is required to retrieve missing information from changes and services request. When Operations is required to put something in production the necessary information is missing incomplete or inconveniantly located in the change in HPSM. The team has to chase multiple people to retrieve the necessary information.
Information is not in a standard format in the changes and sometimes incomplete. Fabio showed the tasks he works on. the information is attached to the change. When the DIA is included it has the right information (sometimes incomplete).
in the design phase there is not enough technical knowledge about the application, resulting in limited technical information. (Source Tomasso Tortorra 0609)
Source Fabio Ruggieri 0309: 25% of time of employee is spent (times 6 people) on retrieving information from changes & Service requests 56 tasks in HPSM in 2018 for the team.
Biweekly meeting with Fulvio's team to discuss incidents, problems and changes
Solution: agree on process to deliver steps for OPS to implement in a standard & consistent way.
(Installation template cannot cover this as it is, as Smart Payments is no part of the applications defined
Plan meeting with PM:Rita Neroni, Giancarlo Valente
OBA Source(MDL): *Tommaso (3/8) * installation Instruction change template, facilitates in the handover from test to deployment to prevent late involvement and lack of information transfer. The document does not cover Smart Payments.
Check overlap with Operational Readiness
Suboptimal collection of requirements for onboarding new customer or new service on Smart Payment (Sepa payments)Potential root cause:
Lack of complete requirementsImpact:
In 40% of changes the operational task accompanying a change is not created in HPSM.
In 60% of changes there is a task but the attachment is not a final document and the PM needs to get this information from supplier or client.
Number changes/year: 100 --> 20-30 changes in smart payments
Request Fabio Ruggieri for time lost due to this + incidents handled
20-25% time spent for 6 persons to ask for additional information, this should be 5%
--> 1.2FTE (To be validated)Potential Solution:
- Automated workflow tool, 3-4 years ago a tool was investigated but due to people leaving the company this was abandoned --> Ask Maurizio Chiametti from COA about.
- Reachtable request (request that happens periodically) when new client existing service, existing client with service request-> does not cover complete impact solution: **
How much % of requests does reachtable request make up?**
Frequency around 1-10 new clients per month
Check with Fabio on the time spent on onboarding a client.High-over process steps to onboard. plus time saved
--> Meeting with Fabio Ruggieri planned 06-09.
From incident report: incidents related to onboarding clients due to missing information.
--> Request to Karin for incident report parked until week 39Check with Chiametti for potential solution
--> Meeting to be planned Bartosz
Questions: What is the procedure of who can contact the client?
• Is there a quality gate, who reviews the above checklist (or alternative source), who approves it?
No checklist, analyst uses experience to get the correct requirements.
How much time do you spend on collecting missing information from the initial requirements?Bartosz
Issue 2: OPS. Source: Sub-optimal approval process for changes for NEXI. The client should have the information about the change 2 weeks in advance.DEV. Source (MDL):
- Decision to ‘go-live’ starts at least 15 days before release, first the decision passes the customer (who is also expected to provide the PM with the (acceptance) test results), second, the decision passes the CAB. If both are in agreement: it’s a go. source Fulvio (4/9)
Information about changes is not always complete in time Impact:
Delays in delivery in production. How can this issue be quantified?Possible Solution:
Check with Fabio for issue understanding, matteo, christian
- Check what is the consequence of not adhering to the procedure? (Penalties!, delays!)
- Use of the Installation Instruction Template is not applied by Matteo's team reason unknown.
RfC process (OBA 3/8)
- An HPSM ticket is created by the customer (traceable throughout the process)
- The intake team evaluates the content of the ticket on completeness, quality & size –
- The intake team connects with the customer to ensure best functional solution
- A functional analyst continues to work with the customer to define the functional solution – high
- The HIA/DIA is transferred to all the teams impacted by the solution
- The functional analyst collects the information from development, testing teams etc.
- The OBA opens receives a task in HPSM (attached with DIA/HIA) and involve al operational teams
- In case input cannot be provided by operations (almost never happens) the OBA uses
handling of task on Mainframe batch scheduling not correctly requested
The team needs to adjust configuration based on changes. For them to know when to do this the team needs to be informed via a "task" in HPSM, which is related to the complete change. The issue is that the DEV department sends the request not via a task but and unrelated service request. Due to this incorrect format the team is not alerted in time about the change. Source: Fabio Ruggieri
Issue not recognised by two other team members Source Ivan & Demetre:
Unknown yet, request has been made to BU about to follow the process
Time wasted to solve confusion about which tasks have to be picked up by who. Delays due to missed requests.
Define the impact
incorporate in Operational Readiness Process
Contacts Project Management: Rita Neroni, Giancarlo Valente
Verified with Matteo 06/09 - discrepency between team members
issue not recognized could be experience and knowledge difference between team members that explains this gap
Issue 1 additional time-spent on service requests
Much time spent on service requests with additional time spent on searching for information, and having many requests about the process.
Source: Matteo Agosta interview 03/09
No structural process/filter in place in request intake from NEXI. In the past there were rules from NEXI how to answer specifically to requests. Currently no SLA's between eWL and NEXI
New contract negotiations where NEXI stopped paying extra for this service
Low quality of delivered services, a lot of times spent on answering services requests.
250 service requests coming in every month.
Retrieve data on the amount of service requests & time spent on it.
Check if there are categories / priorities?
Automation options (template, chatbot, portal) standard form for intake process of the service request
Train Nexxi on most common requests. has been done 2-3 years ago and was successful (source Christian Putelli 06/09)
-->Parked due to low impact + revenue generating nature of service requests
Additional timespent supporting testers from BU Issue 5: Currently excessive amount of time spent in supporting testers from BU in regression testing and testers from NEXI in acceptance testing. Support can be: Asking for missing configuration requirement (PM is responsible for this, NEXI), solving bugs found in acceptance testing,...
Source: Fabio Ruggieri 03/09Potential root cause:
- Lack of clear roles and responsibilities
- Limited knowledge with testers in BU
- Lack of clear test scripts/plans
- Lack of automation in testing
6 people in the team spent each around 25% (1-2 hours/day investigating the bugs) of their time, this should be reduced to 5% in ideal situation (1.5hours/day*6 = 9 hours/day = 45hours/week; 45/37.5= 1.2FTE spent currently on testing while they are not responsible for it; After shift of testing from BU to OPS, the actual gain can be calculated
Source: Interview Ruggieri 10/09
25% of time of team manager is spent on operational activities, this should be reduced to 0% (1.5hours/day = 7.5 hours/week: 7.5/37.5 = 0.2FTE. Potential solutions:
- Shift responsibility for regression testing and acceptance testing to OPS--> Possible, but more resources to do the work. BU payments is also willing to shift regression testing to OPS. Source: Graziano Magnani 10/09
- Automate the regression testing
- Clear R&R matrix
- Clear documentation structure in Wiki or Confluence, need for having only limited admins so the files do not move
- Include OPS remarks in test plan
Fulvio hired 2 test specialists to support test automation, also new colleague in production who gives information to testers and gaining knowledge from testers point of view. crossskilling Source: Matteo Agosta 06/09
CI Owner Gert
- Investigate current R&R concerning solving bugs in testing with BU.
- Deep dive in process of regression and acceptance testing with test expert from BU: Graziano Magnani
--> Question: How much time do the testers in BU spent on testing? How many testers are there? Is there a current R&R iregarding testing practices?
Align with Fulvio Longoni if he agrees with shifting resp. testing from BU to OPS
Align with Matteo Agosta if he agrees to shifting resp. testing from BU to OPS
(meeting 06/09 at 10h30)
issue # time allocation from IAM team manager on managing client with regard to incidents
potential root cause
- ODM (Christian Putelli and Clevis) is taking the lead in managing client relationship with regard to incidents concerning Technical Engineering (system management) and performing activities of : fulfilling incident report, providing statistics, facilitate client meeting. With team manager in contributing role
- team manager is taking the lead in managing client relationship with regard to incidents concerning production center (business , cards and payments) and performing activities of : fulfilling incident report, providing statistics, facilitate client meeting. With ODM in contributing role
source Matteo Agosta 0609
demarcation of responsibility between ODM and team manager on this topic not fully clear impact
on average estimated 1/2 day per week of team manager's time spent on above mentioned activities
(estimated based on time-spent on 35 incidents in Q2, 9 incidents in Q3)Potential solution
time-spent by team manager could be moved to SDM for Italian perimeterapproach
- align with Christian Putelli to get his view on this
- Coverage for Italy included in SDM implementation for PCAP in collaboration with Eric Lenz - validation required with Maartje, DMB 0609