Quest for a New Solver for EPANET 2

Burger, Gregor; Sitzenfrei, Robert; Kleidorfer, Manfred; Rauch, Wolfgang

doi:10.1061/(ASCE)WR.1943-5452.0000596

Open access

Technical Papers

Nov 20, 2015

Quest for a New Solver for EPANET 2

Authors: Gregor Burger [email protected], Robert Sitzenfrei [email protected], Manfred Kleidorfer [email protected], and Wolfgang Rauch [email protected]Author Affiliations

Publication: Journal of Water Resources Planning and Management

Volume 142, Issue 3

https://doi.org/10.1061/(ASCE)WR.1943-5452.0000596

PDF

Abstract

One of the best known hydraulic water distribution modeling toolkits that is most used by both researchers and practitioners is EPANET 2. Initially the authors aimed to speed up such simulations by utilizing modern multicore processors in the code implementation. The 30-year-old linear solver was replaced in steps by seven different modern multicore capable solvers. Subsequently, speedup tests were carried out with small- (up to

1.6 \times 10^{3}

nodes), medium- (up to

6.3 \times 10^{4}

nodes) and large-sized (up to

6.3 \times 10^{5}

nodes) test cases. None of the tested solvers was found to perform faster than the original solver for networks with a real-world character, although two solvers showed a speedup on medium and large water distribution networks. This is an example that strategies to reduce computation time can produce promising results in a theoretic research environment, but fail in practical engineering applications. Likewise, this paper highlights again the importance of considering realistic test cases during the implementation phase. Further, the authors point out why the different tested strategies in this work have not succeeded. Although two other issues (implemented hash table and node reordering) could be identiﬁed for potential improvement of the code, it was concluded that the original solver is still the fastest for practical system conﬁgurations.

Introduction

EPANET 2 (Rossman 2010) is the most often applied software tool for simulating the hydraulics and water quality in water distribution networks. It was developed by the U.S. EPA and is, therefore, in possession of the public domain. Accordingly, there is a very vibrant community of scientists and practitioners using EPANET 2 for various purposes. The popularity of the software also resulted in various extensions and developments.

The key element of water distribution network modeling is the solution of a large set of nonlinear equations. In terms of single application runtime, EPANET 2 already has a quite efﬁcient solver for this numerical problem. Large networks are typically solved in seconds to minutes at maximum. But because of the vast ﬁeld that EPANET 2 is applied to, the simulation software is not just run for single instances. In research scenarios, like network optimizations (Basupi and Kapelan 2015), rehabilitation planning (Roshani and Filion 2015), long-term effect simulations (Sitzenfrei and Rauch 2015), and sensitivity and vulnerability analysis, the model is run multiple times, potentially causing signiﬁcant waiting periods for users to obtain their results (Diao and Rauch 2013; Sitzenfrei et al. 2011, 2013a). Additionally, some of the recently introduced pressure-dependent solvers follow an iterative approach, thus, further increasing the computational costs (Tanyimboh and Templeman 2010; Ang and Jowitt 2006; Siew and Tanyimboh 2011, 2012; Giustolisi et al. 2008; Rossman 2007).

Because of the aforementioned reasons, it is still worthwhile to research a faster solver for EPANET 2 [or enhanced (e.g., Alvarruiz et al. 2015) or even other solving algorithms (e.g., Belova et al. 2014)]. The solver that is currently implemented in EPANET 2 is approximately 30 years old (George et al. 1994). Since then, computers have changed radically and advances in numerical analysis have theoretically found faster and better methods for solving the set of equations. However, it is unclear how these solvers perform in real world applications or when used to solve the specific set of equations created by the Global Gradient Algorithm implemented in EPANET 2.

This paper describes the attempts to speed up EPANET 2 by using modern solver architecture. Special attention was put to utilize the additional parallel computing resources found in modern desktop computers. The current solver of EPANET 2 is not optimized for such parallel processors. Therefore, a major part of the computing resources found in modern processors are not used.

EPANET 2 is based on a solution of a set of nonlinear equations using newton iteration. In each iteration step, a matrix factorization is performed. These matrix operations are known to perform very well in a parallel environment and several packages exist that support parallel linear algebra (Van de Geijn 1997; von zur Gathen 1993; Agullo et al. 2009). A parallel implementation of EPANET 2, for multicore computers, was thought to be a fast and straightforward task. To reduce the development and testing time and to increase the conﬁdence in the implementation, a replace sequential with parallel solver procedure (with very minor code changes) was aimed for.

To prevent others from falling into the same traps of such a seemingly easy task, this paper describes not only the success but rather the quest, i.e., which solvers do not work for common applications of EPANET 2. Several other attempts already point toward a very hard to solve problem.

Alonso et al. (2000) used parallel computing in a distributed fashion for solving the matrix factorization and other unspeciﬁed parts of EPANET 2. They achieved a small speedup on a system with approximately twice the number of pipes than nodes. Although on all other systems, no speedup was achieved. Furthermore, it was not speciﬁed which part of the parallelization contributed to the speedup.

Crous et al. (2012) used a conjugate gradient method to solve the linear equations on a graphics processor unit (GPU). Although the conjugate gradient method performs very well on GPUs (Bolz et al. 2003), Crous et al. (2012) implemented a solver for very large, dense matrices that is not found in typical applications of EPANET 2. It rather describes a completely different problem with different algorithmic complexity. Furthermore, the authors investigations showed that equations resulting from water distribution problems are bad- or ill-conditioned and, therefore, unsolvable with an un-preconditioned iterative method. Recently, Zecchin et al. (2012) showed that iterative solvers preconditioned with an incomplete lower upper (ILU) factorization or algebraic multigrid method (AMG) are able to solve large artiﬁcial networks faster than the current solver, but they failed to show similar results for problems that are typical when applying EPANET 2.

Guidolin et al. (2010, 2011, 2013) found that updating of the coefficients of the linear system costs more than solving the system itself. Therefore, Guidolin et al. (2013) focused on improving this part of the code. In their study, they used the CWSNet library, which is compatible with EPANET 2, but using a different code base, and they achieved a moderate speedup in the performance. However, this approach involves huge code changes in many places and functions of EPANET 2 and not the improvement of the sparse matrix solver as aimed for in this paper.

Another method that has already been applied for solving the problem is domain decomposition (Diao et al. 2013). Whereas the method is particularly well suited for parallelization, information is needed from the underlying graph of sparse linear systems (Saad 2003). Employing such a solver typically involves signiﬁcant code changes and conﬂicts with the idea of encouraging trust in the implementation by making only minimal alterations. In the work of Diao et al. (2013), the applied solver showed a speedup of eight on one tested system. Because the aim of this paper is to obtain a practical and easily usable solution for the main audience of EPANET 2 (practitioners and researchers), a solver on the basis of distributed computing parallelization [as used in Diao et al. (2013)] is, therefore, out of the scope of this work (Gaidamour et al. 2008).

To summarize, the starting point of this investigation is the enhancement of one of the most often applied software tools in urban water management, which is EPANET 2. Signiﬁcant attempts have been undertaken and described in the literature to improve the 30-year-old solution procedures, but none of them showed an overwhelming success. In this paper, the authors give an overview on this issue by strict code and performance analysis, in particular with a view on latest computer architecture, and describe success but also pitfalls of such quest for a new solver for EPANET 2. With code profiling, the most promising function for runtime optimization is identified and the 30-year-old linear solver is replaced in steps by seven different modern multicore capable solvers. None of the tested solvers is found to perform faster than the original solver for networks with a real-world character, although two solvers show a speedup on medium and large water distribution networks. Further, it is found that the node reordering algorithm in EPANET 2 is highly inefﬁcient for large systems and an improvement could improve the performance on medium to large systems dramatically.

Methods

The research behind this paper went through a lot of steps paved with failures and setbacks. In this paper, the authors chose to not only to describe the successful results but instead to tell the full story of the authors’ efforts. Thus, saving other researchers from wasting precious time by repetitive investigation of dead ends. Fig. 1 is the map that should guide the reader through this research. The two large boxes represent the “Methods” and “Results” sections of this paper, respectively. Each small box represents a “Proﬁling” subsection and the three performed tests. All three tests have a dedicated subsection in “Methods” (for the test description) and “Results” (for the results of the test). To understand the reasons behind the tests and their advance in time, it is sometimes necessary to refer to both “Methods” and “Results.” The following paragraph gives a brief overview of the temporal advance in the investigation and makes references to the sections with detailed information.

The ﬁrst step in the research was to proﬁle EPANET 2 (see “Methods: Profiling”). Proﬁling the application led to the conclusion that the solver used in EPANET 2 spends the greatest amount of runtime. The section “Methods: Solver Replacement” describes the ﬁrst attempt to enhance the performance of EPANET 2 by replacing the old solver with a parallel solver. The ﬁrst test revealed that a new parallel solver does not make EPANET 2 faster (see “Results: Solver Replacement”). It was then assumed that the tested systems are too small for the parallel solver. To prove this hypothesis, a second test (described in “Methods: Feasibility of a Parallel Solver”) was conducted to show at what sizes a parallel solver would be faster. The second test revealed that parallel solvers are only faster in system conﬁgurations that are unrealistic and unphysical (see “Results: Feasibility of a Parallel Solver”). There was the suspicion that the very carefully selected solver does not work very well for this type of problem. At the same time, results from the work of Zecchin et al. (2012) further corroborated these suspicions as they showed that a certain type of solver is capable of speeding up only certain types of networks (namely large ones). Therefore, a third test was conducted in which several solvers were benchmarked (see “Methods: Rigorous Solver Benchmark”). The same artiﬁcial networks as in Zecchin et al. (2012) were tested, but additionally also networks with real-world characteristics. The third test revealed that the simplest solver, one that is quite similar to the original solver, is the fastest for small artiﬁcial networks. A multicore parallel solver is the fastest for medium-sized networks; whereas for very large artiﬁcial systems, the solver suggested in Zecchin et al. (2012) is the fastest. Nevertheless, for all networks with real-world characteristics (including small networks), the simple solver is the fastest (see “Results: Rigorous Solver Benchmark”).

Profiling

The underlying aim of this work is to make EPANET 2 faster. In the process of application optimizations, the ﬁrst step is always to ﬁnd the corresponding algorithm (or part of) that is the most computational intense, i.e., which part of the code is the slowest in terms of runtime. A proﬁler is used to give a graphical representation of the functions that were called and how much time was spent during their execution. Fig. 2 shows part of a typical proﬁle of an extended period simulation (large real-world network, 72 h simulation period, 1 h time step) of EPANET 2. Program proﬁles are used to make it clear where the most time of the application is spent. Each box represents a function in the program. The ﬁrst line in the box is the name of the function followed by the cumulative time spent in both the function and in all subsequent function calls. The second line describes just the time spent in the function. An arrow from one box to the other represents a function call from a function (e.g., from netsolve, the functions linkstatus, linsolve, newcoeffs, and newﬂows are called). Lines with an “

\times

” at the end, state how often a call was executed.

The two most time-intensive functions for the specific tested network are linsolve and newcoeffs, which are both called from the netsolve function. The newcoeffs function updates the coefﬁcients corresponding to the gradient method (Todini and Pilati 1988) and ﬁlls them into the matrix; linsolve solves the linear equation system previously ﬁlled by newcoeffs. Although newcoeffs uses 31% of the complete time, the total runtime is distributed throughout the whole range of functions and is, thus, more difﬁcult to assess. In contrast, 29% of the overall runtime is spent in linsolve, which makes it a perfect target for optimization, especially because it is a general purpose linear algebra solver that has been implemented several times in commercial and open-source products. This initial profiling is performed with one single network as a starting point for this optimization procedure. Profiling different networks with variable characteristics could lead to divergent results, and, therefore, drawbacks as it may not represent typical execution time of the application. Nevertheless, using just one network is a common approach for profiling and, for a starting point in this work, it is assumed to be sufficient.

Solver Replacement

After proﬁling, the aim of this work focused on replacing linsolve with a call to a highly optimized solver, preferably running on a GPU. Parallel implementations of LAPACK and BLAS (both are standard code packages for numerical solutions) are common and seemingly achieves very good performance on GPUs and multicore CPUs (Van de Geijn 1997; Volkov and Demmel 2008; von zur Gathen 1993; Humphrey et al. 2010). The problem is that equations that are on the basis of network topologies (i.e., graphs) result in sparse linear systems (Davis 2006). Replacing linsolve with a dense linear solver has already been done (Crous et al. 2008, 2012). But dense solvers perform many more ﬂoating-point operations than sparse solvers, especially for such sparse systems that arise from water distribution modeling. Equations resulting from graph-based problems have multiple zero entries. Skipping these zeros has space and algorithmic advantages because neither does one need to store them in memory nor does one need to calculate numbers out of them.

In contrast to storing dense matrices, which are either in row-or column-major format, sparse matrices must record nonzero elements in a special format. Almost all packages agreed on the compressed sparse row (CSR) or column (CSC) format for storing sparse matrices (Davis 2006). EPANET 2 uses a slightly modified version of CSR and CSC, in which the diagonal is represented as a separate dense array. This can be done because the underlying matrices should always be positive-definite and positive-definite matrices always have diagonal elements [George et al. (1994), p. 160, for more information on the exact format used in EPANET 2].

To keep true to the “replace call to linsolve and do not touch anything else” strategy, the authors need to convert from one sparse format to the other, solve the equations, and write the results back. The equations from EPANET 2 result in a symmetric positive-deﬁnite matrix. Some solvers take advantage of the symmetry and store only the upper or lower part of the matrix, something that needs to be taken into account in the conversion process.

For this ﬁrst test, PARDISO was chosen as a replacement for linsolve. PARDISO is a parallel direct solver using standard implementation routines [OpenMP and MPI (Dagum and Menon 1998; Snir et al. 1995)] for parallelization (Schenk et al. 2000). A multicore solver was initially favored because it is easier to debug on the CPU than on a dedicated GPU for which debugging methods are not as evolved. PARDISO was selected because it offers the best performance in a number of sparse problems (Gould et al. 2007) and is also used by Intel for its math kernel library (MKL) (Intel Corporation 2013). The PARDISO solver is also used in the third test, in which a number of sparse solvers were benchmarked (see “Methods: Rigorous Solver Benchmark”).

To experiment with other solvers and analyze the matrices in numerical computing environments, such as MATLAB, the authors implemented a function to dump the matrices from the EPANET 2 software to an output ﬁle just before the call to linsolve. To ensure that all tests are used under correct conditions and with valid matrices, all benchmarks were implemented to use the dumped matrices from a modiﬁed EPANET 2. Results can, therefore, directly be projected to an actual implementation in EPANET 2. It also allows for fast, easy, and repeatable testing of various solvers. The binary representations of the ﬂoating-point numbers, representing the matrix coefﬁcients, were directly written into a ﬁle. This ensures that ﬂoating-point numbers, with exponent and mantissa, are bit-wise equivalent in both the original EPANET 2 software and in the benchmark application. The authors used the native format of the solver, which is always double type (8 bytes) as modern CPUs are optimized for doubles. Only the compute unified device architecture (CUDA) solver uses single floating point precision as GPUs have more single precision floating point units. The time spent for writing and reading those files was not included in the performance tests.

Feasibility of a Parallel Solver

After the initial implementation of PARDISO in EPANET 2 failed to provide any speedup, it was assumed that the matrices resulting from water distribution networks are too small and too sparse. The additional work, done by a parallel solver, to distribute and synchronize parallel tasks is called the parallel overhead. This parallel overhead is typically not bound to the size of the input and, therefore, the ratio between parallel overhead and actual workload decreases with increasing input size (e.g., larger water distribution networks with more neighbors, as shown in Fig. 3). If this overhead is too high, e.g., for very small and sparse systems, the solver is slower than a sequential solver. Such behavior is equally found in other systems and discussed in detail in Burger et al. (2014).

Fig. 3. Configurations with (a) four neighbors; (b) eight neighbors

To ﬁgure out at what problem sizes the PARDISO solver is faster than the sequential solver, a test with varying problem sizes was performed. For this test, random EPANET 2 input ﬁles were generated with an increasing degree of neighbors (to decrease the sparsity of the system) and increasing number of nodes (to increase the system size). Decreasing the sparsity and increasing the system certainly lowers the parallel overhead to workload ratio.

The graphs for the input systems were generated with a stochastic process. First,

n

nodes were randomly sampled on a unit square

(x, y) \in ([0,1), [0,1))

. For each node,

n_{i}

, the

k

(nearest neighbors) were connected. Ten percent of the nodes were deﬁned as reservoirs to fulﬁll the boundary conditions. The nodal heads, demands, pipe lengths, and pipe roughness coefﬁcients were also randomly sampled. For the test, 882 graphs (with

n

ranging from 400 to 20,000 in 200 steps, and

k

ranging from 2 to 10) were generated. For each graph, an EPANET 2 input ﬁle was generated; the runtime of both the modiﬁed EPANET 2 (with the original solver replaced by the PARDISO solver) and the original EPANET 2 were measured.

Rigorous Solver Benchmark

The third test was used to analyze the performance of various solvers on different ranges of networks. Two types of networks, artiﬁcially generated and real-world networks were used to determine the fastest solver for EPANET 2. The artiﬁcial networks were used to provide a wide range of networks with different sizes. They follow the scheme of networks generated in Zecchin et al. (2012) (see subsection “Artificial Networks”). They are easy to generate but do not necessarily reﬂect the properties of real-world networks. That is why a second set of real-world networks was included in the tests (see “Real-World Networks” subsection). To cover a wide range of different solvers, three iterative and four direct solvers were chosen: ﬁve parallel (CPU and GPU) and two sequential (see “Solvers” subsection).

Artificial Networks

The artiﬁcial input systems were generated following the work in Zecchin et al. (2012). The nodes of the system are placed on an orthogonal grid. The number of nodes in the system is driven by the side length of the square. The side lengths were chosen to be

1 \times 10^{a}

, where

a

ranges from 2 to 6 in 0.2 steps. For the links, a conﬁguration with four and eight neighbors was generated, as shown in Figs. 3 (a and b), respectively. The eight-neighbor conﬁguration was included to decrease the sparsity of the resulting matrix, which should favor parallel solvers.

The resulting graph is then converted to an EPANET 2 input ﬁle format; the nodes represent junctions and the links represent pipes. Ten percent of the junctions are randomly chosen to be reservoirs. The values for the elements were also chosen randomly. For the junctions, the elevations was chosen between 10 and 20 m, and the nodal demand varied between 0 and

10 L / s

. For the pipes, the length was set between 1 and 100 m, the diameter between 100 and 200 mm, and the roughness between 1.0 and 4.0 mm (Darcy-Weisbach), the option “no minor losses” was set and all pipes were open. For the reservoirs, the elevation was deﬁned between 100 and 140 m. The EPANET 2 simulations for the resulting systems were not unbalanced, had no warnings, and no negative pressures.

Real-World Networks

As described in the “Artificial Networks” section, the squared networks allow generating a wide range of artiﬁcial networks with different problem sizes. But these squared networks do not necessarily reﬂect system conﬁgurations in real-world networks. For example, the networks described in the previous section only included junctions, pipes, and reservoirs. Regulating elements such as valves, pumps, rules, or emitters are missing completely.

The real-world networks described in this paper are in many ways different than the generated squared networks described in previous sections. The ﬁrst difference is that they fall into the category of small systems with node counts of up to

10^{4}

. Real-world systems fall into these categories because they are typically made manually or generated from geographical information systems (GIS) data, which are in most cases also made manually. The number of nodes dictates the dimension (

n

) of the resulting matrix.

The second difference is that the artiﬁcially-generated systems have four or eight neighbors, i.e., each node is connected with four or eight pipes. Water network infrastructures with such high neighbor counts are unrealistic because of their high costs and redundancy. The number of neighbors inﬂuences the number of nonzero elements (

n n z

) in the matrix. For a four- and eight-neighbor conﬁguration, the ration between

n n z

and

n

(ﬁll ratio) is 3.0 and 5.0; whereas the ratio for real-world systems is

\sim 2.0

.

The size and ﬁll ratio are parameters that inﬂuence the runtime of all solvers: direct and iterative ones. One parameter that inﬂuences only the iterative solvers is the condition number of a linear system (

A x = b

). The condition number

[κ (A)]

approximates how well a system can be solved by iterative methods. An identity matrix has condition one; it can, therefore, be solved very fast (one iteration) with errors only at the level of machine accuracy. The higher a condition number, the slower the CG (conjugate gradient) method converges, if ever (Saad 2003).

Table 1 summarizes the sizes and system elements of the 12 real-world networks used for benchmarking the solvers in this section. The networks 01-towng1, 04-towng2, 05-townh, and 10-town are actual networks from urban areas in Austria. The networks 02-bwsn1 and 12-bwsn2 are from “The Battle of the Water Sensor Networks” example1 and example2 (Ostfeld et al. 2008). The 03-dtown network is from “The Battle of the Water Networks II” (Marchi et al. 2014). The 06-richmond network is the water distribution network of Richmond in the U.K. (Van Zyl et al. 2004). The 07-wolf network is from the Wolf-Cordera Ranch model (Lippai 2005). The 08-exnet network is the EXNET network from the Exeter University (Farmani et al. 2004). The network 09-run1911 was generated using Sitzenfrei et al. (2013b) and Möderl et al. (2011). The networks 11-vibk and 13-vtirol are virtual networks, generated using Mair et al. (2014), that represent the actual networks for the town of Innsbruck in Austria and for the whole county of Tyrol in the west of Austria, respectively. As is shown, not all of these networks are real-world case studies but, as all of them represent real-world network characteristics, they are listed in this section.

Table 1. Real-World System Information

Name	Junctions	Pipes	Pumps	Reservoirs	Tanks	Valves
01-towng2	127	157	6	2	4	1
02-bwsn1	126	168	2	1	2	8
03-dtown	399	443	11	1	7	5
04-towng1	453	603	5	6	11	10
05-townh	798	881	1	3	3	16
06-richmond	865	949	7	1	6	1
07-wolf	1,782	1,985	6	0	4	4
08-exnet	1,891	2,465	0	2	0	2
09-run1911	5,118	6,769	8	12	17	0
10-town	7,128	7,599	0	0	4	3
11-vibk	12,377	13,025	0	2	0	0
12-bwsn2	12,523	14,822	4	2	2	5
13-vtirol	82,079	84,289	0	2	0	0

Table 2 gives a summary of the solvers. Name is a shortened name used in this paper. Product is the name under which the solver is published. Method is a short description of how the solver solves the linear system. Parallel describes whether the solver runs parallel on a CPU, GPU, or is even parallelized. All solvers were written in native languages (C/Fortran) and used OpenMP on the CPU to parallelize the algorithms.

Table 2. Summary of the Solvers

Name	Product	Algorithm	Method	Parallel
Pard	PARDISO	Left-and right-looking supernode technique	Direct	CPU
Hsl	HSL-MA87	DAG-based	Direct	CPU
Chol	CHOLMOD	Left-looking supernodal	Direct	—
En2	CXSParse	Up-looking	Direct	—
Samg	SAMG	AMG-CG	Iterative	CPU
Vcl	ViennaCL	ILU0-CG	Iterative	CPU
Cuda	CUDA	ILU0-CG	Iterative	GPU

Solvers

The pard (from now on all solvers are identiﬁed by an acronym, as shown in Table 2) solver, also used in the ﬁrst and second test, was selected as the primary CPU-based parallel direct solver. The solver uses a combination of left-right-looking supernode techniques (Schenk et al. 2000; Schenk 2013). To improve scalability, it exploits pipeline parallelism and a dynamic two-level scheduling (Schenk and Gärtner 2002).

The hsl solver is from the HSL (formerly Harwell Subroutine Library) software package. The package MA87 provides a direct solver for sparse symmetric positive deﬁnite systems using OpenMP. In contrast to all other solvers, this solver builds a direct acyclic graph (DAG) for handling dependencies between parallel tasks. Each task works on a block,

L

, which factorizes using a traditional dense Cholesky decomposition. The dependency on

L

is that all rows in

L

are already calculated before work on

L

starts (Hogg et al. 2010). Additionally, the package MC68 is used for approximate minimum-degree reordering (Science and Technology Facilities Council 2013).

The chol solver uses a dynamic supernodal left-looking sparse Cholesky method (Davis and Hager 2009). The chol solver was selected as a high performance sequential solver. The chol package also provides a GPU solver, based on cuBlas, but this feature was not used in this study (Rennich et al. 2014). Cholmod has an internal heuristic, based on the size and density of the matrix, which is used to choose the solving strategy (simplicial or supernodal). On the tested matrices, not even the supernodal factorization was triggered because the matrices are too small (Davies 2015).

The en2 solver is a very simple up-looking sparse Cholesky solver from the CXSparse package as described in Davis (2006). The original functionality of reordering and sparse storage allocation leading to a call to linsolve is not extractable from EPANET 2. The original Fortran methods (GSFCT and GSSLV) from the SPARSPAK package (Chu et al. 1984), described in the book George et al. (1994), are obsolete and have been removed from the netlib repository (Browne et al. 1995). Nevertheless, the factorization can be considered as the simplest one, and can, therefore, be seen as the most similar one to the code found in EPANET 2. Furthermore, the reordering code from EPANET 2 shows a very high asymptotic complexity, which makes the routine unusable for large system sizes [see “Other Observations” and Zecchin et al. (2012)]. So, the authors omitted further benchmark tests of original EPANET code. This nonlinear growth on runtime could not be observed with the code from Davis (2006).

As previously stated, iterative methods cannot solve the very ill-conditioned problems arising from water distribution simulations. Therefore, all iterative methods described in this paper use a preconditioner to enhance the condition of the system; therefore, these solvers are classiﬁed as preconditioned conjugate gradient (PCG) methods. The ﬁrst iterative solver is samg, which is identical as described in Zecchin et al. (2012). It uses an AMG with the CG method as accelerator. This solver also uses OpenMP for parallelization, the special ﬁrst touch policy suggested in the manual was implemented.

The second and third solvers are standard PCG with an incomplete LU zero (ILU0) factorization as preconditioner (Saad 2003). The vcl solver (from the ViennaCL project) was used as an example for a parallel ILU0-PCG on the CPU Rupp et al. (2010). In this research, only the CPU versions have been used as (1) the solver failed to find a solution, and (2) the GPU version of vcl is similar to the CUDA cuBLAS and cuSPARSE libraries. The last solver is an ILU0-PCG on the GPU, using the cuSPARSE and cuBLAS libraries. The implementation follows Naumov (2011), the details behind the GPU version of an ILU0 preconditioner can be found in Li and Saad (2013).

None of the authors of this paper is associated with the development or research in any of the tested solvers.

Testing Procedure

For both types of networks, the same procedure was carried out to evaluate the runtime of the solvers. The ﬁrst step was to dump a matrix representation of the given input ﬁle, using the previously described modiﬁed EPANET 2. Then, the dumped matrix is used to run the benchmark application with all seven solvers. To emulate the outer newton iteration of EPANET 2, the matrix is solved eight times, which is usually enough to reach a satisfactory result in the iterative procedure. Everything that can be done outside the newton iteration (e.g., reordering, symbolic factorization) was only included once in the total runtime of the solver. To compensate for fluctuations in computing time, small systems were solved multiple times and the results were averaged over the number of runs.

All tests were performed on a dual-socket system with an Intel© Xenon© CPU X5650 at 2.67 GHz. Each CPU has six native cores, which results in 12 native-and 24 hyper-threaded cores. The tests were run on 12 threads. The system has 24 GB of RAM and two NVIDIA Tesla C2070 each featuring 6 GB of GDDR5 RAM and 448 stream processors. The GPU solver was executed on only one GPU. The system was operated by Linux with a KERNEL at version 3.10.21.

Results

Solver Replacement

If the PARDISO solver, used in the ﬁrst test to replace the linsolve code, would have caused extreme performance advantages, the paper would have ended here: stating the obligatory speedup numbers. This was not the case. With the conversion in place, the ﬁrst implementation did not gain any speed advantage over the standard EPANET 2 solver.

Converting from one sparse format to the other is an overhead that prevents the new implementation from being faster. To overcome the conversion overhead, EPANET 2 was modiﬁed to use the standard CSR format, as this is being used by most popular solvers including PARDISO. This move diverged from the initial idea to just modify the call to the linsolve function but it was more or less a ﬁnd and replace procedure that in itself makes no difference to the performance of EPANET 2. But again, benchmarks of the CSR format in place did not reveal any speed advantages over the standard EPANET 2.

This indicates that the conversion overhead between different sparse formats is not the limiting factor but the problem is somewhat more complex. It was believed that linear systems from typical sized water distribution networks are too small and sparse for the PARDISO solver. Therefore, a second test was conducted.

Feasibility of a Parallel Solver

This test shows at what number of nodes and number of connections per node the PARDISO solver is faster than the sequential one. In other words: at what degree and sparsity of a matrix a parallel solver is feasible. In Fig. 4,

k

= number of connections per nodes and, therefore, decreases the sparsity of the resulting matrix; and

n

= number of nodes in the system and increases the degree of the matrix. At

n = 2,500

and

k = 6

, the parallel solver is faster than the sequential solver. At approximately

n = 7,000

, the intersection curve stagnates near

k = 4

. This indicates that only with a connectivity degree of four, the parallel solver is faster than the sequential one. Networks with a connectivity degree of at least four (that is equal to 4 pipes connected at each node) can hardly be seen in real-world systems. Even on a two-dimensional lattice, the boundaries have only three neighbors and the corners, even worse, only two. Therefore, the authors must conclude that there is nothing to gain from using PARDISO, one of the fastest parallel direct sparse solvers, in EPANET 2 for real systems.

Fig. 4. (a) Intersection plot of parallel versus sequential solver showing when the parallel solver is faster than the sequential solver; (b) two plots stacked above: light gray is the parallel solver and dark gray is the sequential solver

Rigorous Solver Benchmark

This test shows a head-to-head performance evaluation of seven sparse solvers. The authors have chosen three iterative solvers and four direct solvers. A detailed description of each solver can be found in “Methods: Rigorous Solver Benchmark” followed by a summary of the solvers in Table 2.

Artificial Networks

Fig. 5 shows the results for the artiﬁcial network in a four-neighbor conﬁguration. Up until

3.9 \times 10^{3}

, the en2 solver is the fastest. After that, pard takes the lead up until systems sizes of

1.6 \times 10^{5}

. Then, samg is faster than all others. The authors can conclude that for this test en2 is the fastest for small systems, pard should be used for medium-sized systems, and samg for large systems.

Fig. 5. Runtimes from the artiﬁcial network with a four-neighbor conﬁguration and 12 threads (4NB-12TH)

For small systems, pard has a very ﬂuctuating runtime. This behavior can only be seen in high thread counts and is vanishing when the thread count approaches one. Starting too many threads in pard for small systems is, therefore, counterproductive for the runtime. The hsl reveals similar performance than pard, but uses a direct acyclic graph for selecting parallel tasks. This task-based selection seems to favor the parallelization of small systems.

For the smallest system, samg is the slowest solver but the fastest for the largest ones. It seems that the setup costs of an AMG solver are very high, but there is a beneﬁt especially for large systems. For en2, the opposite is true as it is the fastest solver for small systems but the slowest one for large systems.

The cuda solver has a very ﬂat and stable runtime curve and seems to be the one with the lightest gradient. This suggests that the cuda solver can be the fastest solver for systems that are larger than the ones included in this research. In contrast, the overhead of ILU0 calculation and the memory transfer between the GPU and the main memory (which is not included in the performance tests) makes this solver a pointless choice for all problems of practical relevance. The second IlU0-PCG solver (vcl) seems to perform just in between the other solvers, never fast and never slow.

Fig. 6 shows the runtime with an eight-neighbor conﬁguration. This conﬁguration decreases the sparsity of the matrices. The decreased sparsity and, therefore, decreased ratio between parallel overhead and factorization work was thought to favor the parallel solvers.

Fig. 6. Runtimes from the artiﬁcial network with an eight-neighbor conﬁguration and 12 threads (8NB-12TH)

The results are the same as previously: en2 is the fastest for small systems, pard for medium range systems, and samg for large systems. However, the limits between small, medium, and large systems have shifted: en2 is the fastest for systems below

1.6 \times 10^{3}

, pard for systems up to

6.3 \times 10^{4}

, and samg for the larger systems. Thus, the eight-neighbor conﬁguration does indeed favor the parallel solvers.

Fig. 7 shows the speedup of the parallel solvers. The purpose of this ﬁgure is to show how much can be gained by running a solver in a parallel context compared with the same solver in a sequential context. In the sequential context, the parallel solvers were forced to use only one CPU core. Speedup, in this context, is deﬁned as the value of the runtime in the sequential context (the solver operates with only one thread) divided by the runtime of the solver in the parallel context (the solver uses 12 threads). The horizontal black line marks a speedup factor equal to one, i.e., the solver in a parallel context is as fast as in the sequential context. Values below one indicate that the solver in the parallel context is actually slower than in the sequential context and vice versa.

Fig. 7. Parallel speedup (one versus 12 threads) for the artiﬁcial networks with an eight-neighbor conﬁguration

Except for some rare conditions, parallelization always introduces an overhead. This overhead is the result of managing and synchronizing the work that needs to be done in parallel. With an increasing problem size, the parallel overhead typically decreases because more work can be done in parallel tasks before they must be managed and synchronized. Fig. 7 shows exactly at what problem sizes it is feasible to use the parallel versions of the solvers.

All solvers, except hsl, start to become faster at approximate system sizes of

10^{3}

; hsl overcomes the parallel overhead at systems sizes larger than

1.6 \times 10^{2}

. Although pard has the worst speedup with small systems, it has the best speedup (of 6.81) with the largest systems. However, considering that a 12-core system was used, a speedup of 6.81 reﬂects an efﬁciency of only 0.57.

Real-World Networks

In this research, the real-world networks tested have very high condition numbers as compared with the generated squared networks. The squared networks, depending on their size, have condition numbers between

1.5 \times 10^{3}

and

2.8 \times 10^{4}

. The condition numbers of real-world networks, which can be found in Table 3, range between

8 \times 10^{6}

and

\infty

.

Table 3. Condition Number Estimations of the Real-World Networks

Name	Condition
01-towng2	$1.36 \times 10^{11}$
02-bwsn1	$1.23 \times 10^{10}$
03-dtown	$1.24 \times 10^{11}$
04-towng1	$\infty$
05-townh	$\infty$
06-richmond	$\infty$
07-wolf	$8.33 \times 10^{10}$
08-exnet	$2.88 \times 10^{11}$
09-run1911	$7.96 \times 10^{6}$
10-town	$9.14 \times 10^{21}$
11-vibk	$8.81 \times 10^{9}$
12-bwsn2	$\infty$
13-vtirol	$9.83 \times 10^{9}$

All condition numbers were estimated using the condest command in MATLAB. High condition numbers can cause an increasing number of iterations or even stagnation (i.e., failure) of the iteration process. Accordingly the ILU0-PCG iterative methods had a lot of problems solving systems with high condition numbers. Table 4 summarizes the systems for which the ILU0-PCG solvers failed to ﬁnd a result. Combining Tables 3 and 4, it becomes clear that a water distribution network with size

n ≳ 800

and condition numbers

κ (A) ≳ 10^{9}

, cannot be solved using the ILU0-PCG method. The cuda solver failed even earlier to ﬁnd solutions because it uses single precision ﬂoating-point numbers, a limitation of current GPU computing. Although GPUs that perform double precision floating-point operations are available, these devices are still rare as double precision is not used in typical gaming application. The third iterative solver (samg) was able to solve all systems although a warning of the solver was issued.

Table 4. Real-World Networks for Which the Vcl and Cuda Solver Failed to Find Results

Name	Vcl	Cuda
03-dtown	—	x
04-towng1	—	x
05-townh	—	x
06-richmond	x	x
10-town	x	x
11-vibk	x	x
12-bwsn2	x	x
13-vtirol	x	x

Note: x is a marker indicating which solver/network combinations failed to be solved.

Fig. 8 shows the results of the real-world networks computed with 12 threads. Under these conditions, en2 and chol are the fastest solvers. Compared with the direct solvers, the iterative solvers show slow performance on these small systems. The pard solver shows, again, very ﬂuctuating results for small systems but stabilizes at system sizes of approximately

10^{3}

. Although samg is the only solver capable of solving all real-world systems, the overhead of the AMG method is too large for such small systems. For practical applications (real-world systems with high condition numbers and small node and neighbor counts), the en2 solver is the fastest for all investigated system sizes.

Fig. 8. Real-world networks with 12 threads

Fig. 9 shows the speedup of the parallel solver for the real-world networks. With these small systems, hsl has the highest speedup at approximately 2.8. All other solvers have a speedup below 2.0, which is the theoretical best speedup for the simplest parallel system conﬁguration: a system with two cores. The authors can, therefore, conclude that there is not much to gain from a parallel sparse solver for real-world networks.

Fig. 9. Parallel speedup (one versus 12 threads) of real-world networks

Other Observations

While benchmarking the different solvers, some defects of EPANET 2 have been found. The ﬁrst and most severe one is that the runtime of the classical EPANET 2 grows in the order of approximately

O (n^{3})

for the tested systems. This was estimated by timing several EPANET 2 runs and ﬁtting the resulting curve on the formula

t = a + b n^{c}

, where

t

= observed runtime;

n

= number of nodes; and

a

,

b

, and

c

= free variables. The

ﬁ t

resulted in

a = - {4.58}^{- 1}

,

b {= 2.08}^{- 8}

, and

c = 2.23

. EPANET 2 was run with the generated squared input systems (of the “Results: Rigorous Solver Benchmark” section) with side lengths from 25 to 158, which results in a node count of 625 to 24,964 nodes. Fig. 10 displays the observed runtimes and the approximated curve from the ﬁt. This graph shows how severe the runtime degenerates for systems larger than

10^{4}

. It also shows that for smaller systems, i.e., in the range of real-world systems, the cubic complexity of EPANET 2 does not inﬂuence the overall runtime signiﬁcantly.

Fig. 10. Approximated algorithmic complexity of the original *EPANET 2* code

Fig. 11 shows the proﬁle of a single execution run with a squared input size of 158. With a system of this size, reordernodes takes around 93% of the whole EPANET 2 runtime and linsolve was called eleven times in the execution but takes only 6.5% of the total runtime. This example proves that the node reordering algorithm in EPANET 2 is highly inefﬁcient for large systems. Even though EPANET 2 implements the minimum degree ordering after George et al. (1994), which has a complexity of

O (n^{2} m)

, with

m

as the number of edges per node, the observed runtime suggests that this implemented version of the algorithm is highly unsuitable for large networks.

Fig. 11. Profile for the original *EPANET 2* code for the artificial square input side length of 158

The second observation while working with large input systems was that, at a certain system size, EPANET 2 takes a very long time to parse the input ﬁle. The cause of this suboptimal performance was found to be a weak hashtable implementation. In the ﬁrst phase of input parsing, EPANET 2 uses the names of the links and nodes to connect them to a graph. For a fast lookup of names to their corresponding link or node data, a hashtable was implemented. The complexity of hashtables is the complexity of the hashing method, typically

O (1)

, plus the complexity of collision resolutions, which is, in this case, a chaining with a linked list and, therefore,

O (n)

. In case of no collisions, hashtables have an optimal complexity of

O (1)

. When hashtables get full, collisions are evident and the worst case complexity of

O (n)

is reached (Leiserson et al. 2001). In this case, the hashtable must be resized and rehashed so that fewer or no collisions happen. This is not the case for the hashtable in EPANET 2, which stays at a ﬁxed size of 1,999 buckets. When EPANET 2 was ﬁrst implemented, a bucket count of 1,999 was a good trade-off between memory size and algorithmic complexity. With nodes approaching more than

10^{4}

, a bucket count of 1,999 results in very slow lookup times during the parsing of the input ﬁles.

Discussion and Conclusion

The aim of this paper is to describe the ﬁndings of a long journey searching for a faster solver for EPANET 2. The problem seemed to be easy as a substantial improvement was expected by replacing a 30-year-old code with the current state of mathematical achievements, algorithm design, and also incorporating the latest advances in hardware architecture. The bad news is that for typical real-world network sizes, the authors could not identify a solver that will produce results signiﬁcantly faster than the current one. The good news is nobody has yet wasted any time by still using the old solver (apart from researchers trying to speed up the software).

In this work, rigorous testing of different modern solvers with numerous test cases of different sizes were performed. Therefore, real-world case studies and artiﬁcially generated systems were used. This comprehensive test environment enabled the authors to show that the performance of the solvers is signiﬁcantly inﬂuenced by the system size and the system characteristics. Thus, reported beneﬁts from particular solutions proved to be strongly related to the test environment and not achievable in the tested real-world applications.

For very large generated systems, the same results were found as reported by Zecchin et al. (2012). The proposed solver (samg) is 13 times faster than the solver en2 for the largest tested system. However, the same solver is 508 times slower for the smallest networks, which does not make it a perfect choice. And again, for medium-sized generated systems, a different solver (pard) was found to perform best.

The picture is even worse for real-world systems, which are sparser than the artiﬁcial systems. For the largest real-world system, the original solver (en2) is 10 times faster than the fastest tested solver (samg) and 533 times faster for the smallest system. Likewise, the solver fastest for medium-sized generated networks (pard) is constantly slower for real-world networks than the original solver. Therefore, for practical applications (real-world systems with high condition numbers, small node counts, and small neighbor counts), it can be concluded that the en2 solver is the fastest for all investigated system sizes. None of the other tested solvers can be recommended to be implemented in EPANET 2 as they perform suboptimal for the tested real world problem sets.

Reason is that matrices originating from real-world water distribution models that are based on the gradient method [e.g., the global gradient algorithm (GGA) in EPANET 2 (Todini and Pilati 1988)] are too small and too sparse to speed up any of the tested direct solvers against the original solver. The conditions of the same matrices are too bad to be solved by standard iterative solvers. Only one out of three iterative solvers was able to solve all tested real-world systems.

This raises the questions whether any solver, whether direct or iterative, will ever be able to be faster than the original solver in EPANET 2. These results correspond with findings by Wu and Lee (2011). But if further progress towards a faster solution of the set of nonlinear equations is dubious, research should rather focus on other strategies to speed up such simulations. What can be recommended is to ﬁx the two defects of the current code found in the course of this research. A better hashtable and node reordering implementation does not harm the performance on any system but will improve the performance on medium to large systems dramatically.

Supplemental Data

Tables S1–S3 are available online in the ASCE Library (www.ascelibrary.org).

Supplemental Materials

File (supplemental_data_wr.1943-5452.0000596_burger.pdf)

Download
15.58 KB

Acknowledgments

This work is part of the DynaVIBe project, funded by the Austrian Science Fund (FWF); P23250-N24.

References

Agullo, E., et al. (2009). “Numerical linear algebra on emerging architectures: The plasma and magma projects.” Journal of Physics: Conf. Series, 180, IOP Publishing, Bristol, England, 012037.

Abstract

Introduction

Methods

Profiling

Solver Replacement

Feasibility of a Parallel Solver

Rigorous Solver Benchmark

Artificial Networks

Real-World Networks

Solvers

Testing Procedure

Results

Solver Replacement

Feasibility of a Parallel Solver

Rigorous Solver Benchmark

Artificial Networks

Real-World Networks

Other Observations

Discussion and Conclusion

Supplemental Data

Supplemental Materials

Acknowledgments

References

Information

Published In

Copyright

History

Authors

Affiliations

Metrics

Citations

Download citation

Cited by

Figures

Other

Share

Copy the content Link

Share with email

Share

Request Username

Create a new account

Change Password

Password Changed Successfully

Verify Phone

Congrats!