Parallelisation of a Finite Difference Code for Modelling Ground Water Flow

S.P.G. Madabhushi*, D. Bellotti+, A.Clematis+ & P. Fernandes+

*Department of Engineering, University of Cambridge, Cambridge, CB2 1PZ England

+Istituto per la Matematica Applicata, Consiglio Nazionale delle Ricerche, Via De Marini 6, 16149 Genova, Italy

Abstract

Ground water flow needs to be modelled in a wide variety of civil engineering problems. However, solving of specific boundary value problems is often restricted by the complex geometries one faces in the field especially when the problem has three dimensional features that need to be modelled. In this paper we attempt to solve the problem of ground water flow in porous media using a finite difference code called 'MODFLOW', McDonald and Harbaugh (1988). While this code performs satisfactorily in solving simple two or three dimensional problems, it takes a large computational effort when used to solve flow through layered soil strata with varying hydraulic conductivities. The modular structure of the code renders itself for the development of a parallelised code. Parallelisation of the code was attempted at the Edinburgh Parallel Computing Centre using the Cray T3D machine. In this paper we discuss the parallelisation procedure and compare the results of the sequential code and the parallelised code for a specific problem. Based on the results of a theoretical analysis and early numerical experiments we are confident that the use of parallel codes significantly enhances our capability to analyse complex ground water flow problems for various applications.

1 Introduction

Flow through porous media was studied extensively from late 18th Century. A wide range of problems in Civil Engineering require the understanding of ground water flow through porous media. These problems include many areas of active research such as determination of yield of wells situated at different locations above acquifers or determination of flow patterns under a dam in hydraulics, the study of contaminant transport in ground water in Environmental Geotechnics to problems of outflow from oil bearing strata in Petroleum Engineering. The transport processes of flow through porous media are now well understood and the equations governing the flow through porous media are well established. However, when the solutions for specific boundary value problems are solicited using the mathematical models developed so far, one finds that it is difficult to solve three dimensional problems which truly represent the field conditions. With the developments in the digital computers one is able to solve such complex 3-D problems.

The 'MODFLOW' finite difference code was developed by McDonald and Harbaugh (1988), with the aim of solving three dimensional problems of ground water flow through porous media. This code was developed based on the long term work of Trescott (1975), Trescott & Larson (1976), Trescott, Pinder and Larson (1976) and was used extensively by US Geological Survey and others. The MODFLOW code is a block-centred finite difference code which was developed with a modular structure. While the program works satisfactorily for simple two and three dimensional problems of ground water flow, the computational effort is rather large for complex three dimensional problems with layered soils of varying hydraulic conductivities.

Exploiting the modular structure of MODFLOW program, a parallel version of code was developed at the Edinburgh Parallel Computing Centre. The main goal was to reduce the analysis time while solving complex ground water flow problems. It would be helpful to parallelise the code especially for use in applications with large data domains which take significant amount of time as seen in later sections. The basic requirements for the parallelisation procedure are identified as;

  1. The parallelisation should allow large re-use of sequential code;
  2. The parallelisation should be simple, efficient and based on standard and machine independent tools to provide effective portability;
  3. The underlying model should be preserved in order to make the parallel implementation as much consistent as possible with the sequential one.

In this paper we discuss the parallelisation procedure and compare the analysis times of the original sequential MODFLOW code and the parallel version. The results of the numerical experiments are also presented in this paper.

2 Mathematical Modelling

The governing equation for ground water flow through porous medium may be written using the partial differential equation;

(1)

where and are the hydraulic conductivities in the three orthogonal directions, h is the head driving the flow, W is the volumetric flux per unit volume and represents the source/sink term for water, is the specific storage capacity of the porous medium and t is the time. In general , and may be functions of space and W may be a function of space and time. The above equation together with the specification of flow and/or head conditions at the boundaries and the specification of an initial head condition constitutes a mathematical model of flow through anisotropic, heterogeneous porous medium. The solution is based on the finite differences using the Block Successive Overrelaxation method. The block is a "slice" of the problem domain, orthogonal to the y axis and one cell wide. In this iterative method, head values at nodes in the same slice are simultaneously updated by solving the (sub) system of the corresponding equations, treating the terms for adjacent slices as known quantities. In an iteration, all the slices are sequentially processed in order of decreasing y; then, a new iteration is started, unless a termination test is satisfied. This algorithm for the numerical solution of the ground water flow discrete model is same as the one implemented in the original sequential version of the MODFLOW code.

3 Parallelising the Algorithm

Time stepping is inherently a sequential procedure and no parallelisation can be done at this level. Parallelisation was carried out for the processing of the linear system which is solved to step from a time level to the next one. This was done by exploiting the already existing subdivision of the problem domain in slices together with a general feature of relaxation methods. In solving the linear system, indeed, data dependencies exist that prevent parallelisation of sequential algorithm without modification. In fact, the new head values obtained in processing any slice are used in processing the next. Nevertheless, as relaxation methods are very flexible, most of these data dependencies can be easily removed just by changing the order in which the slices are processed (Fox, 1988).

A relaxation method is defined by giving a way of updating the values in a subdomain (possibly reduced to a point) using the most recently computed values in adjacent subdomains and a order in which the subdomains are to be processed. However, even though the convergence rate may be affected by the processing order, no specific order is mandatory for convergence. Hence, as data dependencies depend on the processing order, part of them can be removed by a suitable reordering, at the cost of a possible reduction of the convergence rate.

In order to obtain a sequential algorithm more suitable for parallelisation, it is convenient to subdivide the problem domain in m blocks containing the same number of contiguous slices and to renumber the slices in the following order: slice 1 of block 1, slice 1 of block 2, ..., slice 1 of block m, slice 2 of block 1, slice 2 of block 2, ... , slice 2 of block m, and so on. With this processing order, when the first slice of any block is processed, the most recently computed head values in the last slice of the previous block are those obtained in the previous iteration. Hence, there is no data dependency between contiguous blocks and they could be processed in parallel.

In order to parallelise the modified sequential algorithm, we use an array of workers, and a master processor. Each worker takes directly its input data, co-operates to execute computation for a time step, and sends its results to the master process. The master initialises some general information and the array, then wait for results at the end of each time step to check global consistency of the solution by computing the global water budget, and prepare output. Figure 1 summarises the behaviour of master and worker processes within a generic time step. Each worker takes a subdomain of k contiguous slices and process it. In order to process its own subdomain, the worker needs information on the border slices of contiguous subdomains. Contiguous workers process contiguous subdomains and communicate each other to exchange this information. Data flow and domain partition is shown in Figure 2(a) and Figure 2(b), while Figure 3 summarises control flow and process interactions.

4 Evaluating the Parallel Algorithm

In this section we provide an analysis of the expected speed-up of the parallel algorithm and early experimental results. It is important to evaluate the actual speed-up, that is the ratio between the execution time of the original sequential algorithm and the execution time of the parallel algorithm. For this reason we provide an analytical formulation of speed-up in both cases. We obtain an almost linear expected speed-up with respect to the modified sequential algorithm, while further investigations are necessary to asses the speed-up with respect to the original sequential algorithm, however early experimental results show an improved speed-up.

4.1 Execution time for the sequential algorithms

We are interested in evaluating the execution time of the computational kernel of the algorithm. For a given domain D, and for each stress period this time is the sum of the time required to compute the single time-step, hence for time steps:

(2)

In Eq.2 represents the time to compute the volumetric budget, this operation is repeated at each time step. The time to compute the single step depends on the time necessary to execute the inner loop and on the number of iterations, which may be different for the different time steps, hence: .

The time to execute the inner loop () is the same for the different time steps, while the number of iterations may change. If we define: then it is possible to write (2) as:

(3)

This formulation is general and holds both for the original sequential algorithm and for the modified version, it is the number of iterations that is different in the two cases. If we denote with the execution time, with the number of iterations of the modified sequential algorithm, and assume that is negligible with respect to the solution time of the linear system, then the following holds:

(4)

4.2 Execution time for the parallel algorithm

The execution time of the parallel algorithm is evaluated considering the division of work between the master and worker processes. The computational kernel is executed by workers. Figure 1 summarises the behaviour of master and worker processes within a generic time step. For each time step the master process updates the global budget using results of the previous step, and then wait for new results from workers. Workers solve the system for that step each one on a different subdomain, and provide back results to the master. System solution has a higher cost with respect to budget update, hence the master is normally waiting for workers communication and its cost is masked by workers. Hence it is possible to provide a first formulation of the execution time for the parallel algorithm for m time steps:

(5)

Here is the cost of communication between workers and master at each time step, is the global cost of a worker over the m time steps on a subdomain , and is the cost of budget update after last time step.

In each iteration a worker communicate with each one of its two direct neighbours, and participate to the global reduction to test convergence. If we indicate with the total cost of communications and participation to global reduction, it is possible to further refine the cost of the worker:

It is important to notice that the number of iterations is the same of the modified sequential algorithm. We may then rewrite the execution time for the parallel version as:

(6)

4.3 Speed-up Analysis

As discussed above we may consider the speed-up provided by the parallel version with respect to the original and modified sequential algorithm. Using P processors, 1 for the master and P-1 for the workers, the speed-up with respect to the original sequential algorithm is:

(7)

This formula cannot be further reduced since it is not possible to establish, at this stage, the exact relationship between the number of iteration for the original and modified algorithm. Looking at the speed-up formulation with respect to the modified version we may go further, in fact Eq.7 becomes:


From this formula we may argue that a linear speed-up should be obtained. In fact if the domain partition is not too fine grained then , and , considering that is negligible with respect to the other part of the computation, then we have that:

(8)

4.4 Experimental results on Cray-T3D

The program has been implemented on Cray-T3D of EPCC and early experiment's show good results. The same problem has been considered for different domain sizes, and the measured times show that the algorithm has a good scalability. In order to provide a complete comparison it is useful to consider execution times for a sequential implementation in a more common environment such as a sparc-20 workstation. Figure 4 reports execution times for three different domain sizes in the workstation environment. It shows that the cost of the algorithm grows very fast, since doubling the domain size leads to an increase of a factor around 3.6 in execution time.

The benefit of parallelisation is clear looking at Figure 5, that reports execution time on T3D using 2, 4, 8 and 16 processors and for the above domain sizes. It is important to point out that with the minimal T3D configuration (2 processors) we still have an almost sequential algorithm (the modified one), since the master does not contribute to the linear system solution, and only one worker computes it. Hence passing from 2 to 4 processors is equivalent to pass from almost sequential execution to a parallel one with 3 processors. This fact justifies the otherwise superlinear speed-up obtained.

It is interesting to notice that using 4 processor (i.e. a parallelism of level 3) we obtain a comparable execution time for the different domain sizes. For the largest domain scaling up to 16 processors still provides a good efficiency (around 70%). These early experiences seem to confirm the expected linear speed-up.

Planned future work is aimed to broaden experimental evaluation and looking at using parallel code in a production environment. At this aim performances of parallel code using local area workstation network as a parallel virtual machine will be evaluated.

Acknowledgements

The first and second authors acknowledge the support of the TRACS programme at EPCC (Edinburgh Parallel Computing Centre), funded by the European Commission's Training and Mobility of Researchers programme. All the authors acknowledge EPCC for the opportunity of carrying out experiments on Cray-T3D.

References

[1] G. Fox et al, (1988), Solving Problems on Concurrent Processors, Prentice Hall Inc.

[2] M.G. McDonald and A.W. Harbaugh, (1988), A modular three dimensional finite-difference ground water flow model, volume 6 of "Techniques of Water-Resources Investigation of the United States Geological Survey", chapter A1, Scientific Software Group.

[3] P.C. Trescott and S.P. Larson, (1976), Documentation of finite-difference model for simulation of three-dimensional ground water flow, Supplement to Open-File Report 75-438, U.S. Geological Survey Open-File Report 76-591, pp 21.

[4] P.C. Trescott, G.F. Pinder, S.P. Larson, (1976), Finite difference model for aquifer simulation in two dimensions with results of numerical experiments, in Techniques of Water-Resources Investigation of the United States Geological Survey, Book 7, pp 116.

[5] P.C. Trescott, (1975), Documentation of finite-difference model for simulation of three-dimensional ground water flow", U.S. Geological Survey Open-File Report 75-438, pp 32.