Water systems are essential for health, safety, and well-being. As such, they are considered critical infrastructures (CIs) (
McPherson and Burian 2005) whose disruption of service can have significant debilitating impacts. Contemporary water CIs are moving toward cyber and physical integration, merging processes with computational systems to form cyber-physical systems (CPSs) (
Lee 2008). Water CIs inherit larger attack surfaces (
Howard et al. 2005) from the entanglement of cyber and physical layers, and additional pressure is introduced to their strategic and tactical planning. Advanced cyberattacks are designed to infringe upon the physical domain through communication and/or computational infrastructures, thereby evolving into cyber-physical threats. Increased functionalities related to the autonomous operation of subsystems, real-time monitoring, and remote-control capabilities, designed to increase efficiency, are becoming risk sources, exploited by adversaries to disturb or even weaponize water supplies (
Janke et al. 2014). In this era, enhancing data-driven emergency preparedness and planning, to better comprehend and manage emerging risks, helps ensure safe and resilient water systems for communities (
Ugarelli et al. 2018).
Cyber-Physical Vectors in Risk Management
The intertwining of cyber and physical layers arguably increases efficiency and accuracy by offering capabilities like remote real-time control (RRTC) for pressure management (
Giustolisi et al. 2017;
Page et al. 2017), mitigation of combined sewer overflow (CSO) (
Garofalo et al. 2017) or extension of actuators’ lifespan (
Lund et al. 2018), and the detection of contamination (
Wang et al. 2015) or leakages at the household level (
Kossieris et al. 2014). On the other hand, just as water CIs benefit from shifting to more integrated CPSs, so do potential adversaries by constantly adjusting their tactics, techniques, and procedures (TTPs) (
Johnson et al. 2016) to exploit the new cyber-physical domain. The European Union’s (EU) Agency for Network and Information Security (ENISA) has reported a shift in the threat landscape from individuals to companies (
ENISA 2019), while for the same period the annual strategic report of the European Cybercrime Center (EC3) identifies the convergence of cyber and terrorism (
Europol 2018). Access to a range of malwares and anonymization and encryption tools or services through the Darknet enables even inexperienced threat actors to exploit vulnerabilities and perform cyber-physical attacks (CPAs) that go well beyond their actual know-how and skills. Common misguided security perceptions over industrial control systems (
Loukas 2015), broad geographical expansion of CPSs (
Konstantinou et al. 2015), and a rise in sophistication (
Rasekh et al. 2016) of malicious codes allow for a range of manipulation and deception attacks.
According to the latest Verizon Data Breach Investigations Report (DBIR) (
Verizon 2019), 23% of the reported breaches involved nation state- or state-affiliated actors and 28% leveraged malware to establish or advance attacks. Relevant to those findings, a technical alert by US-CERT (
US-CERT 2018) revealed that since at least March 2016 multiple US CIs, including water-sector CIs, were strategically targeted by foreign government cyber actors who, inter alia, gained access to industrial control systems (ICSs). Compromised ICSs and unauthorized access over such systems can go undetected for long periods, exposing water CIs and society to significant risks. Such is the case of a company responsible for supplying a number of neighboring counties, anonymized under the pseudonym “Kemuri Water Company” (
Verizon 2016), with an unusual operation of remotely controlled assets lasting nearly 2 months. Attackers accessed the platform that supervised hundreds of programmable logic controllers (PLCs), gaining control over and altering the dosing of chemicals used for water treatment and the water flow per se, compromising supply services. Another example from the recent water-sector CPA history is the 2013 near-miss after a successful supervisory control and data acquisition (SCADA) hack of Bowman Dam in New York. As made public by the relevant indictment, the attacker, who had repeatedly obtained data on water levels, temperature and sluice status, had gained access to the sluice remote control system as well. Fortunately, he was unable to escalate his threat only because the dam’s sluice was disconnected for maintenance. A recent review on the sector’s incidents by Hassanzadeh et al. (
2020) reveals the diversity of attackers’ TTPs and resulting consequences. Officially disclosed or otherwise, recent incidents or near-misses in water CIs have raised a caution flag that should not be overlooked.
Proactive risk management calls for prevention and preparedness, through structured multidisciplinary approaches for stress testing (
Galbusera et al. 2014;
Licák 2006), against current and future threats. Both the EU [CD 2008/114/EC, article 2(c)] and the USA (
US Department Homeland Security 2009) critical infrastructure protection (CIP) frameworks recommend risk assessments that follow the threat scenarios approach. The latter, considered the drivers of emergency simulations (
Grance et al. 2006), act as catalysts that trigger the exploration of infrastructure exposure to risk and inspire actions to protect against potential threats, up to weaponization of supply (
ASME-ITI 2009). To meet those objectives, cyber-physical threat scenarios must address key threat characteristics and explore multiple durations and escalations of events under existing operating plans and available alternatives (
Bouchon et al. 2008), while rendering the adversary’s TTPs.
Toward Realistic Stress Testing
Stress tests are risk and safety assessment approaches designed to associate the severity of a threat scenario with its impact on the system or society, performing the core analysis required for the prevention of risks and the preparedness of the CI (
Galbusera et al. 2014). Physical, cyber, geographical, and logical interdependencies within a system, as defined by Rinaldi et al. (
2001), allow for cascading effects to occur. For water CIs to prepare against events that may cascade from the cyber to the physical layer and vice versa, appropriate stress-testing environments are required that can model those dynamics (
Nikolopoulos et al. 2018). This has triggered cyber-oriented research on developing virtual SCADA environments and test vulnerabilities (
Almalawi et al. 2013;
Chen et al. 2015;
Davis et al. 2006;
Fovino et al. 2010;
Queiroz et al. 2011;
Siaterlis et al. 2013). In the water CI domain, the widely accepted EPANET model (
Rossman 2000), used to simulate hydraulic systems, has recently started transforming toward bridging that cyber-physical gap. Eliades et al. (
2016) provided a programming interaction for a simulator through MATLAB in an effort to assist research in the field of smart water networks, used by Taormina et al. (
2018) to deploy the EpanetCPA toolbox and link monitoring and control device interactions to traditional network hydraulics. The latter, inter alia, provides a structured way of importing CPA scenarios and pass them, in a certain level of modeling abstraction, through a hydraulic solver. In a similar manner, Klise et al. (
2017) developed an open-source Python software version 0.2.2 package, the Water Network Tool for Resilience (WNTR), which employs both EPANET and a purpose-built EPANET-based simulator to allow for the modeling and simulation of water distribution networks (WDNs), focused on network resilience in physical emergency states (e.g., earthquake, power outage). WNTR has been recently used in the work of Nikolopoulos et al. (
2019a) in an early prototype of a cyber-physical stress-testing platform called RISKNOUGHT.
Stress testing introduces distributed or point loads that cause performance to drift outside normal boundaries and lead to nonideal conditions of service. Events, like an attack or power outage at a pumping station, can lead to pressure deficiency conditions, in which demand-driven analysis (DDA) solvers, such as the original EPANET solver, pose limitations (
Chmielewski et al. 2016). DDA solvers continuously supply nodes regardless of the pressure, yielding unrealistic demand satisfaction and hydraulic behavior. Quality of generated data is directly linked to the simulation approaches and methods chosen (
Wand and Wang 1996), and as such, DDA is not suitable for the purposes of a stress test. On the other hand, linking pressure to nodal outflow allows for pressure-driven analysis (PDA) through nodal head-flow relationship (NHFR) formulas (e.g.,
Fujiwara and Li 1998;
Germanopoulos 1985;
Wagner et al. 1988). Maximum demand satisfaction is met under optimal pressure conditions and decreases as pressure drops, down to a minimum operating value. Because water supply is based on available operating pressure at each node, PDA-based stress-testing platforms are indeed able to represent pressure deficiency effects more realistically (
Todini 2003).
Utilizing state-of-the-art tools and methodologies that best fit the purposes of the analysis can form a realistic cyber-physical stress-testing approach. Subsequently, this produces data deemed to be of high quality that need to be mined to express failure. This work provides a defined structure to interpret a system’s predicament by translating simulation data to failure information, aiming to enhance risk-informed decisions and prioritization of actions.
From Data to Performance Information
Translating model-derived data to meaningful aggregates and keeping an overview of the simulated behavior is a difficult task because of the large volume of raw data. Real water network models contain thousands of nodes and assets, dynamically operated over a simulation period. Even skeletonized network models, with known limitations and shortcomings (
Davis and Janke 2018), produce large sets of data, while fine-time-resolution simulation adds to both the detail and volume of results. Thus, making sense of stress-test results in a structured and efficient way becomes of paramount importance for facilitating risk-informed decision-making (
Hansson and Aven 2014) and can be achieved by mapping results to suitable indicators.
Water-sector experts often keep track of network service performance and communicate company goals through sets of measures designed for management which are found to be very similar across countries and companies (
Vilanova et al. 2015). The similarity originates from the common fundamental processes, assets, and overall goals of water companies, with metrics usually focused on five main categories of management interest:
•
Quality of service, which includes both quantity and quality delivered to customers;
•
Asset, which includes the physical performance of the infrastructure;
•
Operational, which relates to daily system monitoring and maintenance;
•
Personnel, which focuses on human resources management; and
•
Financial, which keeps track of the financial soundness and economic prosperity of the company.
Performance measures are metrics that quantify the efficiency or effectiveness of an action (
Neely et al. 2005). These are categorized into result indicators (RIs), answering the question of what has been achieved so far, and performance indicators (PIs), indicating what needs to be done to increase performance (
Parmenter 2015). Adding the word
key indicates the importance of those factors in achieving the defined goals, revealing the critical success factors. RIs capture the results of operational actions and show whether the organization is “travelling at the right direction at the right speed” (
Parmenter 2015), which is important for the governance of the organization. The difference can be seen through the definition provided by Alegre et al. (
2016), where PIs are efficiency and effectiveness measures for the delivery of services with respect to target values. Such values are benchmarks used for comparison (
Cable and Davis 2004) or as reference points of improved performance (
Malano et al. 2004) and the establishment of policies (
Walter et al. 2009). Thus, metrics like those presented by Alegre et al. (
2016), Danilenko et al. (
2014), Kanakoudis et al. (
2011), Bouziotas et al. (
2019), and others, serving trend monitoring (
Andersen and Fagerhaug 2002) and long-term benchmarking objectives (
Berg 2013), do not reveal the dynamics or inner characteristics of a system failing under stress.
An emerging concept related to the performance of water systems under stress is that of resilience. Being a relatively recent term in the water industry, it has received many definitions in the scholarly literature (
Francis and Bekera 2014). The variations are mostly subtle (
Butler et al. 2017), while a stress-testing-oriented approach regarding water system resilience is given by Makropoulos et al. (
2018), who define resilience as “the degree to which an urban water system continues to perform under progressively increasing disturbance.” As an expansion of stability (
Holling 1996), resilience is linked to the ability of a system to react to stress conditions (
Todini 2000) and reduce the magnitude or duration of disruptive events (
NIAC 2009), to retain a level of functionality. Several studies have tried to capture and indicate resilience against a failure (mainly in network design), as a generic inverse function of failure time (
Hashimoto et al. 1982;
Kjeldsen and Rosbjerg 2005), quantified via resilience profile graphing tools (
Makropoulos et al. 2018), through a demand satisfaction ratio (
Mehran et al. 2015;
Zhuang et al. 2012), available energy (
Creaco et al. 2016;
Todini 2000), or graph theory metrics (
Herrera et al. 2016), while resilience through operational and financial dimensions is proposed in the Risk Analysis and Management for Critical Asset Protection (RAMCAP) approach (
AWWA 2010). Though it is used in system design optimization, linked to recovery plans (
Chmielewski et al. 2016), and becoming increasingly recognized as essential to rethinking contemporary water CPSs (
Nikolopoulos et al. 2019b), it has been argued that no resilience measure proposed to date can adequately describe cascading failures (
Shin et al. 2018) or provide adequate context on a system’s complex integrity predicament during stress.
Recognizing a need to summarize, interpret, and communicate data derived from stress testing, a quantification framework is proposed in the following sections that is deployable through a purpose-built tool. It is designed specifically for water CIs under threat and is adjustable to any internal and external operating environment of a water utility.