Parallelisation of a Finite Difference Code for Modelling Ground Water Flow
S.P.G. Madabhushi*, D. Bellotti+, A.Clematis+ & P. Fernandes+
*Department of Engineering, University of Cambridge, Cambridge, CB2 1PZ England
+Istituto per la Matematica Applicata,
Consiglio Nazionale delle Ricerche, Via De Marini 6, 16149 Genova, Italy
Abstract
Ground water flow needs to be modelled in a wide variety of civil engineering problems. However, solving of specific boundary value problems is often restricted by the complex geometries one faces in the field especially when the problem has three dimensional features that need to be modelled. In this paper we attempt to solve the problem of ground water flow in porous media using a finite difference code called 'MODFLOW', McDonald and Harbaugh (1988). While this code performs satisfactorily in solving simple two or three dimensional problems, it takes a large computational effort when used to solve flow through layered soil strata with varying hydraulic conductivities. The modular structure of the code renders itself for the development of a parallelised code. Parallelisation of the code was attempted at the Edinburgh Parallel Computing Centre using the Cray T3D machine. In this paper we discuss the parallelisation procedure and compare the results of the sequential code and the parallelised code for a specific problem. Based on the results of a theoretical analysis and early numerical experiments we are confident that the use of parallel codes significantly enhances our capability to analyse complex ground water flow problems for various applications.
1 Introduction
Flow through porous media was studied extensively from late 18th Century. A wide range of problems in Civil Engineering require the understanding of ground water flow through porous media. These problems include many areas of active research such as determination of yield of wells situated at different locations above acquifers or determination of flow patterns under a dam in hydraulics, the study of contaminant transport in ground water in Environmental Geotechnics to problems of outflow from oil bearing strata in Petroleum Engineering. The transport processes of flow through porous media are now well understood and the equations governing the flow through porous media are well established. However, when the solutions for specific boundary value problems are solicited using the mathematical models developed so far, one finds that it is difficult to solve three dimensional problems which truly represent the field conditions. With the developments in the digital computers one is able to solve such complex 3-D problems.
The 'MODFLOW' finite difference code was developed by McDonald and Harbaugh (1988), with the aim of solving three dimensional problems of ground water flow through porous media. This code was developed based on the long term work of Trescott (1975), Trescott & Larson (1976), Trescott, Pinder and Larson (1976) and was used extensively by US Geological Survey and others. The MODFLOW code is a block-centred finite difference code which was developed with a modular structure. While the program works satisfactorily for simple two and three dimensional problems of ground water flow, the computational effort is rather large for complex three dimensional problems with layered soils of varying hydraulic conductivities.
Exploiting the modular structure of MODFLOW program, a parallel version of code was developed at the Edinburgh Parallel Computing Centre. The main goal was to reduce the analysis time while solving complex ground water flow problems. It would be helpful to parallelise the code especially for use in applications with large data domains which take significant amount of time as seen in later sections. The basic requirements for the parallelisation procedure are identified as;
In this paper we discuss the parallelisation procedure and
compare the analysis times of the original sequential MODFLOW code and the parallel
version. The results of the numerical experiments are also presented in this paper.
2 Mathematical Modelling
The governing equation for ground water flow through porous medium may be written using the partial differential equation;
(1)
where and
are the
hydraulic conductivities in the three orthogonal directions, h is the head driving
the flow, W is the volumetric flux per unit volume and represents the source/sink
term for water,
is the specific storage capacity of the porous medium and t is the
time. In general
,
and
may be
functions of space and W may be a function of space and time. The above equation
together with the specification of flow and/or head conditions at the boundaries and the
specification of an initial head condition constitutes a mathematical model of flow
through anisotropic, heterogeneous porous medium. The solution is based on the finite
differences using the Block Successive Overrelaxation method. The block is a
"slice" of the problem domain, orthogonal to the y axis and one cell wide. In
this iterative method, head values at nodes in the same slice are simultaneously updated
by solving the (sub) system of the corresponding equations, treating the terms for
adjacent slices as known quantities. In an iteration, all the slices are sequentially
processed in order of decreasing y; then, a new iteration is started, unless a termination
test is satisfied. This algorithm for the numerical solution of the ground water flow
discrete model is same as the one implemented in the original sequential version of the
MODFLOW code.
3 Parallelising the Algorithm
Time stepping is inherently a sequential procedure and no parallelisation can be done at this level. Parallelisation was carried out for the processing of the linear system which is solved to step from a time level to the next one. This was done by exploiting the already existing subdivision of the problem domain in slices together with a general feature of relaxation methods. In solving the linear system, indeed, data dependencies exist that prevent parallelisation of sequential algorithm without modification. In fact, the new head values obtained in processing any slice are used in processing the next. Nevertheless, as relaxation methods are very flexible, most of these data dependencies can be easily removed just by changing the order in which the slices are processed (Fox, 1988).
A relaxation method is defined by giving a way of updating the values in a subdomain (possibly reduced to a point) using the most recently computed values in adjacent subdomains and a order in which the subdomains are to be processed. However, even though the convergence rate may be affected by the processing order, no specific order is mandatory for convergence. Hence, as data dependencies depend on the processing order, part of them can be removed by a suitable reordering, at the cost of a possible reduction of the convergence rate.
In order to obtain a sequential algorithm more suitable for parallelisation, it is convenient to subdivide the problem domain in m blocks containing the same number of contiguous slices and to renumber the slices in the following order: slice 1 of block 1, slice 1 of block 2, ..., slice 1 of block m, slice 2 of block 1, slice 2 of block 2, ... , slice 2 of block m, and so on. With this processing order, when the first slice of any block is processed, the most recently computed head values in the last slice of the previous block are those obtained in the previous iteration. Hence, there is no data dependency between contiguous blocks and they could be processed in parallel.
In order to parallelise the modified sequential algorithm, we use
an array of workers, and a master processor. Each worker takes directly its input data,
co-operates to execute computation for a time step, and sends its results to the master
process. The master initialises some general information and the array, then wait for
results at the end of each time step to check global consistency of the solution by
computing the global water budget, and prepare output. Figure 1 summarises the behaviour
of master and worker processes within a generic time step. Each worker takes a subdomain
of k contiguous slices and process it. In order to process its own subdomain, the
worker needs information on the border slices of contiguous subdomains. Contiguous workers
process contiguous subdomains and communicate each other to exchange this information.
Data flow and domain partition is shown in Figure 2(a) and Figure 2(b), while Figure 3
summarises control flow and process interactions.
4 Evaluating the Parallel Algorithm
In this section we provide an analysis of the expected speed-up of the parallel
algorithm and early experimental results. It is important to evaluate the actual speed-up,
that is the ratio between the execution time of the original sequential algorithm and the
execution time of the parallel algorithm. For this reason we provide an analytical
formulation of speed-up in both cases. We obtain an almost linear expected speed-up with
respect to the modified sequential algorithm, while further investigations are necessary
to asses the speed-up with respect to the original sequential algorithm, however early
experimental results show an improved speed-up.
4.1 Execution time for the sequential algorithms
We are interested in evaluating the execution time of the computational kernel of the
algorithm. For a given domain D, and for each stress period this time is the sum of the
time required to compute the single time-step, hence for time
steps:
(2)
In Eq.2 represents the time to compute the volumetric budget, this operation is
repeated at each time step. The time to compute the single step depends on the time
necessary to execute the inner loop and on the number of iterations, which may be
different for the different time steps, hence:
.
The time to execute the inner loop () is the
same for the different time steps, while the number of iterations may change. If we
define:
then it is possible to write (2) as:
(3)
This formulation is general and holds both for the original sequential algorithm and
for the modified version, it is the number of iterations that is different in the two
cases. If we denote with the execution time, with
the
number of iterations of the modified sequential algorithm, and assume that
is
negligible with respect to the solution time of the linear system, then the following
holds:
(4)
4.2 Execution time for the parallel algorithm
The execution time of the parallel algorithm is evaluated considering the division of
work between the master and worker processes. The computational kernel is executed by
workers. Figure 1 summarises the behaviour of master and worker processes within a generic
time step. For each time step the master process updates the global budget using results
of the previous step, and then wait for new results from workers. Workers solve the system
for that step each one on a different subdomain, and provide back results to the master.
System solution has a higher cost with respect to budget update, hence the master is
normally waiting for workers communication and its cost is masked by workers. Hence it is
possible to provide a first formulation of the execution time for the parallel algorithm
for m time steps:
(5)
Here is the cost of communication between workers and master at each time step,
is the
global cost of a worker over the m time steps on a subdomain
, and
is the
cost of budget update after last time step.
In each iteration a worker communicate with each one of its two direct neighbours, and
participate to the global reduction to test convergence. If we indicate with the
total cost of communications and participation to global reduction, it is possible to
further refine the cost of the worker:
It is important to notice that the number of iterations is the same of the modified
sequential algorithm. We may then rewrite the execution time for the parallel version as:
(6)
4.3 Speed-up Analysis
As discussed above we may consider the speed-up provided by the parallel version with
respect to the original and modified sequential algorithm. Using P processors, 1
for the master and P-1 for the workers, the speed-up with respect to the
original sequential algorithm is:
(7)
This formula cannot be further reduced since it is not possible to establish, at this
stage, the exact relationship between the number of iteration for the original and
modified algorithm. Looking at the speed-up formulation with respect to the modified
version we may go further, in fact Eq.7 becomes:
From this formula we may argue that a linear speed-up should be obtained. In fact if
the domain partition is not too fine grained then , and
,
considering that
is negligible with respect to the other part of the computation,
then we have that:
(8)
4.4 Experimental results on Cray-T3D
The program has been implemented on Cray-T3D of EPCC and early experiment's show good results. The same problem has been considered for different domain sizes, and the measured times show that the algorithm has a good scalability. In order to provide a complete comparison it is useful to consider execution times for a sequential implementation in a more common environment such as a sparc-20 workstation. Figure 4 reports execution times for three different domain sizes in the workstation environment. It shows that the cost of the algorithm grows very fast, since doubling the domain size leads to an increase of a factor around 3.6 in execution time.
The benefit of parallelisation is clear looking at Figure 5, that reports execution time on T3D using 2, 4, 8 and 16 processors and for the above domain sizes. It is important to point out that with the minimal T3D configuration (2 processors) we still have an almost sequential algorithm (the modified one), since the master does not contribute to the linear system solution, and only one worker computes it. Hence passing from 2 to 4 processors is equivalent to pass from almost sequential execution to a parallel one with 3 processors. This fact justifies the otherwise superlinear speed-up obtained.
It is interesting to notice that using 4 processor (i.e. a parallelism of level 3) we obtain a comparable execution time for the different domain sizes. For the largest domain scaling up to 16 processors still provides a good efficiency (around 70%). These early experiences seem to confirm the expected linear speed-up.
Planned future work is aimed to broaden experimental evaluation and looking at using
parallel code in a production environment. At this aim performances of parallel code using
local area workstation network as a parallel virtual machine will be evaluated.
Acknowledgements
The first and second authors acknowledge the support of the TRACS programme at EPCC
(Edinburgh Parallel Computing Centre), funded by the European Commission's Training and
Mobility of Researchers programme. All the authors acknowledge EPCC for the opportunity of
carrying out experiments on Cray-T3D.
References
[1] G. Fox et al, (1988), Solving Problems on Concurrent Processors, Prentice Hall Inc.
[2] M.G. McDonald and A.W. Harbaugh, (1988), A modular three dimensional finite-difference ground water flow model, volume 6 of "Techniques of Water-Resources Investigation of the United States Geological Survey", chapter A1, Scientific Software Group.
[3] P.C. Trescott and S.P. Larson, (1976), Documentation of finite-difference model for simulation of three-dimensional ground water flow, Supplement to Open-File Report 75-438, U.S. Geological Survey Open-File Report 76-591, pp 21.
[4] P.C. Trescott, G.F. Pinder, S.P. Larson, (1976), Finite difference model for aquifer simulation in two dimensions with results of numerical experiments, in Techniques of Water-Resources Investigation of the United States Geological Survey, Book 7, pp 116.
[5] P.C. Trescott, (1975), Documentation of finite-difference model for simulation of three-dimensional ground water flow", U.S. Geological Survey Open-File Report 75-438, pp 32.