Applying Root Lead to Evaluation (RCA) to Organization Continuity

By Stacy Gardner, Avalution Consulting
Article originally posted on Avalution Consulting’s Blog

Though many business continuity standards emphasize the importance of tracking corrective actions to address identified issues, the recently published ISO 22301 (and previously BS 25999-2) also requires conducting a root cause analysis – looking not just at an issue, but its cause and how it can be prevented in the future.   Root cause analysis (RCA) is an approach that seeks to proactively prevent reoccurrences of the same adverse event or systems failure by tracing causal relationships of a failure to its most likely impactful origin, then putting measures in place to mitigate underlying causes to ultimately help prevent recurrence of the adverse event in the future.  While common in disciplines that deal with extreme precision and protection of life (e.g. quality and environmental health and safety), there’s no reason the business continuity discipline cannot benefit from a similar approach, particularly for practitioners looking to fully implement ISO 22301.  This article explains root cause analysis and identifies how organizations can benefit from implementing the concept in a business continuity context.

The concept of root cause analysis was originally developed by Sakichi Toyoda (the founder of Toyota Motor Corporation), who developed a process called the “Five Whys” to understand potential causes for problems beyond what was immediately obvious.  Root cause analysis became more formalized as it was integrated into several different fields as a performance driver, such as safety, quality, operations and information security.  In each of these areas, reactively responding to an issue was not enough – future issues needed to be prevented, and root cause analysis was the path to enable improved performance and risk mitigation by eliminating true causes, rather than just symptoms.  Incorporating root cause analysis into existing business continuity-related corrective action efforts could very well minimize the likelihood of future disruptive incidents and decrease recovery times.

At times, performing RCA is as easy as implementing the five whys, repeatedly asking “why” something occurred until it seems like you’ve reached the baseline cause of how failure occurred.  The key is a disciplined application of asking probing questions.  For example, analyzing the root cause of why an organization failed to meet a 24-hour recovery time objective for its SAP environment during a recent test could look something like this:

  1. Problem: IT recovery personnel failed to recover the organization’s SAP system within its recovery time objective of 24 hours during last week’s IT DR test   …. Why?
  2. IT recovery personnel said that SAN LUNs were not mapped correctly, which drastically delayed the start of restoration from disk   … Why?
  3. Vendor personnel responsible for prepping the equipment failed to execute the setup specifically to documented expectations   … Why?
  4. Vendor personnel indicated that the instructions seemed contradictory and did not provide the level of detail necessary to execute steps, so they used a basic default setup  …Why?
  5. Upon analysis, documentation did leave out several crucial steps necessary to enable this complex LUN mapping to occur   …Why was this not found earlier?
  6. When performing previous testing, personnel did not fully leverage existing plan documentation  … What changed this time?
  7. The individual responsible for documenting the plan and performing past testing was unavailable, and personnel who performed testing this time indicated they were not properly trained on use of the plans, nor were they instructed on how to escalate issues regarding recovery processes.

Although it might seem the root cause was reached, simply fixing the documentation does not ensure future documentation will be accurate.  Taking it deeper, the previous IT subject matter expert responsible for documenting the procedures often does onsite testing without using documentation, as he has extensive experience in this field and felt he could perform tasks more quickly by recovering based on experience as opposed to documented procedures.  Exploring the issue further revealed that newer personnel assigned to recovery tasks were far less experienced and had not yet received an appropriate level of awareness training.  Related to this point, the IT Director admitted he never required other personnel to validate documentation, as testing takes time away from production support and leveraging the “experts” in each phase lessens testing time.

Part of the solution to this could be to implement an expectation that all documented procedures be validated at least annually by another IT individual within a different area of expertise.  A second part of the solution could be to perform appropriate training up front (that emphasizes familiarity with plans and knowledge of escalation procedures) for both alternate internal individuals and any vendor resources responsible for plan execution.  Together, these efforts could help assure that all IT DR documentation can be effectively used by both internal and external resources during testing.

Although simple in theory, identifying the actual root cause and figuring out when you’ve gone far enough can be complex in practice.  To help understand primary root causes, you must repeatedly ask variants of “why” (and a few other probing questions), then look for the answer that seems most likely to have influenced the issue.  While there may not be a “hard science” to root cause analysis, the deeper you look for causes, the more likely you are to find issues to resolve.  In most cases, the biggest issue most organizations face is not exploring problems in the first place!  Our example demonstrated this problem in the recovery of SAP.  However, it’s likely this problem (the shortcuts) exists in other areas, and addressing the root cause could improve performance and recoverability elsewhere.

Variants of

Within business continuity, there are several areas that can commonly be identified as root causes for risk mitigation, response and recovery performance issues, although again, it requires tracing issues back further than most professionals choose to explore.  To properly integrate root cause analysis into continuous improvement activities, each issue should be adequately documented, including source of issue, a detailed description, an identification date, and it should also have a field to capture root cause analysis.  Rather than one individual trying to identify the root cause, business continuity personnel should organize and facilitate discussions that involve subject matter experts to whom issues may be assigned or who can provide insight on an issue, and then the group should seek to trace the issue back to its origin together.

Within business continuity, there are numerous root causes that can lead to a variety of issues or complications. The following table notes a few examples, together with likely root causes, though this is far from a complete list.  Also, it’s important to note that just like with tree roots that feed a tree’s growth, there could be more than one root cause that affects a system and results in a problem, so it is important to trace all potential paths of an issue’s origin back, rather than just pursuing one direct cause, to identify all influencing factors.

Problem and Potential Root Cause

Again, root cause analysis is not just solving one instance of a problem, it’s also seeking opportunities to prevent future occurrences of an issue.  Once the origin of an issue is identified, it’s important to evaluate all areas of the business to identify other at-risk areas and ensure proper risk mitigation measures are put in place.  A solution in one area may not necessarily be applicable to all other areas of an organization, but even if it’s not, the act of identifying other similar at-risk areas raises awareness and enables the organization to develop additional solutions that make sense and address these risks before they result in future issues or downtime.

As business continuity management systems continue to mature, root cause analysis will become a powerful tool for business continuity professionals to deeply examine the cause of issues and provide an opportunity to correct them before they occur again.

____________

Stacy Gardner, Managing Consultant
Avalution Consulting: Business Continuity Consulting

Our consulting team regularly publishes perspectives (shorter, independent articles) that touch on the trends currently affecting our profession and the strategic issues facing our clients. This is one of our most recent posts, but the full catalog of our perspectives – over 100 published since 2005 – can be accessed via our blog.

Mgt Summit RCA presentation

Applying Root Trigger Analysis (RCA) to Organization Continuity

By Stacy Gardner, Avalution Consulting
Article originally posted on Avalution Consulting’s Blog

Though many business continuity standards emphasize the importance of tracking corrective actions to address identified issues, the recently published ISO 22301 (and previously BS 25999-2) also requires conducting a root cause analysis – looking not just at an issue, but its cause and how it can be prevented in the future.   Root cause analysis (RCA) is an approach that seeks to proactively prevent reoccurrences of the same adverse event or systems failure by tracing causal relationships of a failure to its most likely impactful origin, then putting measures in place to mitigate underlying causes to ultimately help prevent recurrence of the adverse event in the future.  While common in disciplines that deal with extreme precision and protection of life (e.g. quality and environmental health and safety), there’s no reason the business continuity discipline cannot benefit from a similar approach, particularly for practitioners looking to fully implement ISO 22301.  This article explains root cause analysis and identifies how organizations can benefit from implementing the concept in a business continuity context.

The concept of root cause analysis was originally developed by Sakichi Toyoda (the founder of Toyota Motor Corporation), who developed a process called the “Five Whys” to understand potential causes for problems beyond what was immediately obvious.  Root cause analysis became more formalized as it was integrated into several different fields as a performance driver, such as safety, quality, operations and information security.  In each of these areas, reactively responding to an issue was not enough – future issues needed to be prevented, and root cause analysis was the path to enable improved performance and risk mitigation by eliminating true causes, rather than just symptoms.  Incorporating root cause analysis into existing business continuity-related corrective action efforts could very well minimize the likelihood of future disruptive incidents and decrease recovery times.

At times, performing RCA is as easy as implementing the five whys, repeatedly asking “why” something occurred until it seems like you’ve reached the baseline cause of how failure occurred.  The key is a disciplined application of asking probing questions.  For example, analyzing the root cause of why an organization failed to meet a 24-hour recovery time objective for its SAP environment during a recent test could look something like this:

  1. Problem: IT recovery personnel failed to recover the organization’s SAP system within its recovery time objective of 24 hours during last week’s IT DR test   …. Why?
  2. IT recovery personnel said that SAN LUNs were not mapped correctly, which drastically delayed the start of restoration from disk   … Why?
  3. Vendor personnel responsible for prepping the equipment failed to execute the setup specifically to documented expectations   … Why?
  4. Vendor personnel indicated that the instructions seemed contradictory and did not provide the level of detail necessary to execute steps, so they used a basic default setup  …Why?
  5. Upon analysis, documentation did leave out several crucial steps necessary to enable this complex LUN mapping to occur   …Why was this not found earlier?
  6. When performing previous testing, personnel did not fully leverage existing plan documentation  … What changed this time?
  7. The individual responsible for documenting the plan and performing past testing was unavailable, and personnel who performed testing this time indicated they were not properly trained on use of the plans, nor were they instructed on how to escalate issues regarding recovery processes.

Although it might seem the root cause was reached, simply fixing the documentation does not ensure future documentation will be accurate.  Taking it deeper, the previous IT subject matter expert responsible for documenting the procedures often does onsite testing without using documentation, as he has extensive experience in this field and felt he could perform tasks more quickly by recovering based on experience as opposed to documented procedures.  Exploring the issue further revealed that newer personnel assigned to recovery tasks were far less experienced and had not yet received an appropriate level of awareness training.  Related to this point, the IT Director admitted he never required other personnel to validate documentation, as testing takes time away from production support and leveraging the “experts” in each phase lessens testing time.

Part of the solution to this could be to implement an expectation that all documented procedures be validated at least annually by another IT individual within a different area of expertise.  A second part of the solution could be to perform appropriate training up front (that emphasizes familiarity with plans and knowledge of escalation procedures) for both alternate internal individuals and any vendor resources responsible for plan execution.  Together, these efforts could help assure that all IT DR documentation can be effectively used by both internal and external resources during testing.

Although simple in theory, identifying the actual root cause and figuring out when you’ve gone far enough can be complex in practice.  To help understand primary root causes, you must repeatedly ask variants of “why” (and a few other probing questions), then look for the answer that seems most likely to have influenced the issue.  While there may not be a “hard science” to root cause analysis, the deeper you look for causes, the more likely you are to find issues to resolve.  In most cases, the biggest issue most organizations face is not exploring problems in the first place!  Our example demonstrated this problem in the recovery of SAP.  However, it’s likely this problem (the shortcuts) exists in other areas, and addressing the root cause could improve performance and recoverability elsewhere.

Variants of

Within business continuity, there are several areas that can commonly be identified as root causes for risk mitigation, response and recovery performance issues, although again, it requires tracing issues back further than most professionals choose to explore.  To properly integrate root cause analysis into continuous improvement activities, each issue should be adequately documented, including source of issue, a detailed description, an identification date, and it should also have a field to capture root cause analysis.  Rather than one individual trying to identify the root cause, business continuity personnel should organize and facilitate discussions that involve subject matter experts to whom issues may be assigned or who can provide insight on an issue, and then the group should seek to trace the issue back to its origin together.

Within business continuity, there are numerous root causes that can lead to a variety of issues or complications. The following table notes a few examples, together with likely root causes, though this is far from a complete list.  Also, it’s important to note that just like with tree roots that feed a tree’s growth, there could be more than one root cause that affects a system and results in a problem, so it is important to trace all potential paths of an issue’s origin back, rather than just pursuing one direct cause, to identify all influencing factors.

Problem and Potential Root Cause

Again, root cause analysis is not just solving one instance of a problem, it’s also seeking opportunities to prevent future occurrences of an issue.  Once the origin of an issue is identified, it’s important to evaluate all areas of the business to identify other at-risk areas and ensure proper risk mitigation measures are put in place.  A solution in one area may not necessarily be applicable to all other areas of an organization, but even if it’s not, the act of identifying other similar at-risk areas raises awareness and enables the organization to develop additional solutions that make sense and address these risks before they result in future issues or downtime.

As business continuity management systems continue to mature, root cause analysis will become a powerful tool for business continuity professionals to deeply examine the cause of issues and provide an opportunity to correct them before they occur again.

____________

Stacy Gardner, Managing Consultant
Avalution Consulting: Business Continuity Consulting 

Our consulting team regularly publishes perspectives (shorter, independent articles) that touch on the trends currently affecting our profession and the strategic issues facing our clients. This is one of our most recent posts, but the full catalog of our perspectives – over 100 published since 2005 – can be accessed via our blog.

A Closer Appear At: ISO 22301

I just downloaded the updated Rules and Regulations spreadsheet… To say there is a lot of great content and information in this spreadsheet would be an understatement. This Rules and Regulations spreadsheet was compiled by a team of industry experts (all members of the DRJ EAB). 

The most recent update to this resource was in August 2012, and I thought it would be a good idea to write about different rules and regulations that you might not know about, have been recently amended or added or you might not fully understand. (Yes, this is me urging you to post comments about which rules and regulations you would like me to investigate and write about for you!) 

For the first look at the rules and regulations that impact everyone in the BC space, this post focuses on ISO 22301. 

 ISO 22301 

Here is the short summary of ISO 22301 from the bsigroup.com website: 

ISO 22301 is the new international standard for business continuity management. It has been created in response to strong international interest in the original British Standard BS 25999-2 and other regional standards. And if you meet the requirements to gain certification, your organization will be recognized globally. 

ISO 22301 identifies the fundamentals of a business continuity management system, establishing the process, principles and terminology of business continuity management. 

It provides a basis for understanding, developing and implementing business continuity within your organization and gives you confidence in business-to-business and business-to customer dealings. Use it to assure key stakeholders that your business is fully prepared and you can meet internal, regulatory and customer requirements. 

The standard provides organizations with a framework to ensure that they can continue operating during the most challenging and unexpected circumstances – protecting their staff, preserving their reputation and providing the ability to continue to operate and trade. 

What does this really mean? 

Essentially, this standard gives your organization the basis for identifying the threats facing your organization and how to withstand and be prepared for these threats. With ISO 22301 you have the tools to react proactively and be prepared for these threats. 

With this level of preparation and framework, your investors, colleagues, partners and brand have the confidence that your organization is prepared and ready to face threats and disaster head-on. 

ISO 22301 provides a formal business continuity framework and will help you to develop a business continuity plan that will keep your business running during and following a disruption. It will also minimize the impact so you can resume normal service quickly, ensuring key services and products are still delivered. 

How does it impact your business? 

We’ve written before in this space about how critical it is to be prepared for every level of threat – this includes natural disasters as well as normal day-to-day disruptions such as employee illness or loss of supply chain continuity. All of these can have a big impact on the success of your business and its ability to remain profitable. 

With ISO 22301 you have undergone the certification that proves, you are aware of and have identified these threats. The impact to your business being that your business is ready and prepared to react to threats and limit disruption. 

What do you need to tell your colleagues? 

A visit to the bsigroup.com website details a long list of benefits – so we’ll highlight a few here that stand out: 

Cost savings : You’ll have the opportunity to reduce the burden of internal and external BCM audits, improve financial performance and reduce business disruption insurance premiums.

Business improvement: 
Certification requires a clear understanding of your entire organization which can identify opportunities for improvement. 

Continuous improvement : The certification process involves regular audits that ensure your management system is up to date. 

Maximize quality and efficiency : ISO 22301 provides a framework based on international best practice based around the ‘Plan, Do’ Check, ‘Act’ concept. 

As you know there is a very long list of reasons why your business needs to adhere to rules and regulations – and each rule and regulation has its own benefits. 

What is interesting with ISO 22301 is the impact it has on BS 25999-2: 

  • BS 25999-2 has been superseded by ISO 22301. 
  • BS 25999-2 should be withdrawn on November 1, 2012. 
  • Businesses can make a transition from BS 25999-2 to ISO 22301. 
  • BS 25999-2 certification remains valid during the transition to ISO 22301. 
  • Certifications and renewals for BS 25999-2 will end after May 2014. 

Next steps? 

Now that you have the basics of this new standard, it is time to sit down and really review the website, watch the webinars, and send your questions to [email protected] 

Make sure you review the recently updated DR Rules and Regulations spreadsheet – you can use this spreadsheet to quickly compare these rules and regulations and easily access more information. (And don’t forget to respond to this post and let us know about the rules and regulations you’d like us to take a closer look at.)

How to Establish Danger Appetite in the Context of Organization Continuity

By Brian Zawada &amp Jacque Rupert, Avalution Consulting
Write-up originally posted on Avalution Consulting&rsquos Blog

The introduction of ISO 22301 (Societal security &ndash Requirements &ndash Enterprise continuity management system) far more closely aligns company continuity to the broader danger management discipline. A main contributor to this alignment is the common&rsquos requirement to realize the organization&rsquos &ldquorisk appetite&rdquo (a term not used in BS 25999).&nbsp

ISO 22301&rsquos definition of threat appetite (Section 3.49) is the &ldquoamount and variety of risk that an organization is willing to pursue or retain&rdquo. The regular makes reference to risk appetite in two sections:

ISO 22301 and Danger Appetite

In addition, the authors of the guidance document supporting ISO 22301, titled ISO DIS 22313, make one particular further reference to threat appetite in the section focused on establishing the context for the business continuity management technique:

ISO 22301 and Danger Appetite

For these searching for alignment with or certification to ISO 22301, organization continuity professionals (or those charged with enterprise continuity planning) should realize the idea of risk appetite and address the needs outlined above.&nbsp

Please note: the goal of this post is not to provide a comprehensive, theoretical understanding of risk appetite, as other whitepapers and info sources already do this, but rather to introduce the idea to company continuity professionals and offer you insight on leveraging and &ldquoimplementing&rdquo this idea in our profession.

The Relationship Amongst Danger Appetite and Business Continuity
We think the contributors to ISO 22301 integrated the notion of threat appetite (&ldquoamount and type of risk that an organization is willing to pursue or retain&rdquo) into a enterprise continuity management program standard for two important factors:

  1. Organizations ought to view danger appetite as all-encompassing, incorporating all places of threat, including the company continuity-associated risks linked with disruptive incidents and&nbsp
  2. Utilizing danger appetite to adequately scope and support a business continuity management system aids align business continuity to organizational strategy and other risk management efforts, enabling organization continuity to better integrate into broader threat management.&nbsp

Further, when carried out effectively, risk appetite becomes a key input to (and it could overlap considerably with) a company continuity management system&rsquos scope and objectives.&nbsp

Keys to Determining Threat Appetite
As noted above, many sources of information are obtainable that describe the concept of danger appetite and the greatest method for determining an organization&rsquos danger appetite. Avalution analyzed these sources to aid further understand how to most properly help our clientele in determining and documenting their danger appetites as it pertains to organization continuity preparing, as properly as integrate the notion into our own company continuity system (since we are actively transitioning from BS 25999-two to ISO 22301 within our organization). One particular of the most valuable sources we identified is a white paper published by the Institute for Danger Management (IRM), which introduced a quantity of &ldquodesign&rdquo aspects the authors considered as important to figuring out danger appetite. 3 of these design aspects, or considerations, are paraphrased below, which we located aids to better realize and decide danger appetite:&nbsp

  1. An organization&rsquos danger appetite is &ndash or should be &ndash measurable&nbsp
  2. The acceptability of threat must have a time (temporal) consideration, to ensure periodic assessment (given organizational and environmental alter)&nbsp
  3. Threat acceptance ought to not have anything to do with relaxing controls (risk treatment options)&nbsp

With this stated, and in our opinion, some of the sources of data &ndash other than executive management &ndash that organizations really should evaluate when figuring out danger appetite incorporate:

  • Annual reports and monetary statements&nbsp
  • Consumer contracts&nbsp
  • Regulatory requirements&nbsp
  • Business strategic plans&nbsp
  • Marketing and advertising materials&nbsp
  • Board meeting minutes&nbsp

Although we will not go into additional detail on determining threat appetite, these looking for extra data should contemplate reviewing the following:

  • COSO &ndash Understanding and Communicating Danger Appetite&nbsp
  • ERM Symposium &ndash Cremonino&nbsp
  • Towers Perrin &ndash ERM Threat Appetite&nbsp
  • COSO &ndash ERM Executive Summary&nbsp

Instance &ndash Risk Appetite at Avalution
In transitioning from BS 25999-2 to ISO 22301, we had to understand how risk appetite pertains to our business continuity management method, provided that this is a new formalized requirement essential for certification. Using the guidance and method described in the previous section of this article, we documented our risk appetite summary as follows:

In 2012, we are willing to tolerate a finite amount of downtime as long as it does not outcome in the following:

  1. Damaged reputation among our clients that leads to broader, unfavorable market place perception
  2. Missed service level agreements particular to The Preparing Portal and BC Catalyst&nbsp
  3. Financial loss in excess of $ 50,000
  4. Project delays of much more than three days due to resource disruption and lost information

In order to align our existing organization continuity system with this statement relating to danger appetite, Avalution management intends to staff and appropriately resource our enterprise continuity management program to minimize downtime in the most effective, pragmatic manner feasible.&nbsp

As noted earlier in this short article, this statement aligns with the IRM style considerations, specifically:

  • It aligns to our merchandise and services, as well as our organization&rsquos strategic priorities, and hence the scope of our company continuity management program&nbsp
  • It delivers quantifiable techniques to measure risk&nbsp
  • It notes a time element (2012)&nbsp
  • It notes where our management team accepts a level of risk, which frees resources to boost our company, services and technology, as effectively as invest in our men and women&nbsp

Conclusions
Danger appetite is an critical idea that involves strategic, operational and tactical elements &ndash all of which influence the productive implementation and continual improvement of a business continuity management program. Taking into consideration threat appetite as element of organization continuity organizing allows business continuity to far more closely align with threat management efforts, enabling enterprise continuity efforts to focus mostly on the risks management is unwilling to accept regarding critical items, services, business processes and resources (all of which an organization should obviously document within its danger appetite). Understanding the boundaries &ndash based on an acceptable level of threat &ndash introduces focus and clarity in arranging, which outcomes in greater levels of effectiveness and efficiency in safeguarding an organization&rsquos most time-sensitive or vital activities.&nbsp

Further, considering danger appetite in the context of organization continuity planning really should support management frame organization continuity in relation to how they currently think about the broader subject of dangers to the organization, with the danger of disruptive incidents becoming only one particular factor to consider. Aligning the organization continuity work to how management already thinks (on a strategic level) really should contribute to a stronger, clearer value proposition for the preparedness effort, which ought to allow long-term support and management involvement.&nbsp

Due to the benefits outlined throughout this short article, Avalution believes that the idea of threat appetite is a welcome addition to ISO 22301, and one particular that organization continuity specialists must find out far more about in order to be an active participant in a broader threat management effort.

________________________

Brian Zawada, Director of Consulting &amp Jacque Rupert, Managing Consultant
Avalution Consulting: Company Continuity Consulting

Our consulting group frequently publishes perspectives (shorter, independent articles) that touch on the trends currently affecting our profession and the strategic troubles facing our clientele. This is one particular of our most current posts, but the complete catalog of our perspectives &ndash over 100 published because 2005 &ndash can be accessed by means of our weblog.