On solving separable block tridiagonal linear systems using a GPU implementation of radix-4 PSCR method (cid:73)

Partial solution variant of the cyclic reduction (PSCR) method is a direct solver that can be applied to certain types of separable block tridiagonal linear systems. Such linear systems arise, e.g., from the Poisson and the Helmholtz equations discretized with bilinear ﬁnite-elements. Furthermore, the separability of the linear system entails that the discretization domain has to be rectangular and the discretization mesh orthogonal. A generalized graphics processing unit (GPU) implementation of the PSCR method is presented. The numerical results indicate up to 24-fold speedups when compared to an equivalent CPU implementation that utilizes a single CPU core. Attained ﬂoating point performance is analyzed using rooﬂine performance analysis model and the resulting models show that the attained ﬂoating point performance is mainly limited by the oﬀ-chip memory bandwidth and the eﬀectiveness of a tridiagonal solver used to solve arising tridiagonal subproblems. The performance is accelerated using oﬀ-line autotuning techniques.

On solving separable block tridiagonal linear systems using a GPU implementation of radix-4 PSCR method $

Introduction
Separable block tridiagonal linear systems appear in many practical applications.In such a system, the coefficient matrix is presented in a separable form using Kronecker matrix tensor products.A good example is a Poisson equation with Dirichlet boundary conditions discretized in a rectangular domain using an orthogonal finite-element mesh and piecewise linear finite-elements.A similarly treated Helmholtz equation either with absorbing boundary conditions (ABC) [1,2,3] or a perfectly matched layer (PML) [4,5], among others, also leads to a suitable linear system when discretized using bilinear (or trilinear) finiteelements [6].Suitable linear systems can also be obtained using various finite difference approximations.Separable block tridiagonal systems appear as subproblems in a variety of situations.For example, a similarly discretized diffusion equation with a suitable implicit time-stepping scheme (implicit Euler being the simplest example) leads to the solution of a separable block tridiagonal linear system on each time step and suitable systems appear in image processing applications (see, e.g., [7]).In fact, the implementation described in this paper has already been used in image denoising [8].
Several effective numerical methods have been derived by employing many useful properties of the Kronecker matrix tensor forms.A comprehensive survey of these so-called matrix decomposition algorithms (MDAs) can be found in [9].
A MDA operates by reducing the linear system into a set of smaller sub-systems which are solved independently of each other.The solution of the original linear system is then obtained by reverting the reduction operation.MDAs are similar to the well-known method of separation of variables and many MDAs utilize the fast Fourier transformation method while performing the reduction operation.
Cyclic reduction methods are a well-known class of numerical algorithms that can be applied, among many other things, to tridiagonal and block tridiagonal linear systems.A traditional cyclic reduction type method (see, e.g., [10,11,12,13,14,15]) operates in two stages.The first stage generates a sequence of sub-systems by recursively eliminating odd numbered (block) rows from the system.As a result, the size of each sub-system is approximately half of the size of the preceding sub-system.This means that the reduction factor, or the radix number as it is often called, is two.The sub-systems are solved in reverse order during the back substitution stage of the algorithm.The reduction operation often takes advantage of the properties of the coefficient matrix.For example, in certain cases, the sub-systems can be represented using matrix rational polynomials which considerably simplifies the formulas and reduces the amount of memory needed perform the computations [11,13,16].A survey of the cyclic reduction methods and their applications can be found in [17].
While scalar cyclic reduction (CR) method [10] and its parallel variant (parallel cyclic reduction, PCR) [18] have become very popular methods for the solution of tridiagonal linear systems on graphics processing units (GPUs) (see, e.g., [19,20,21,22,23,24,25]), the block cyclic reduction type methods have been a somewhat less popular topic.Comparisons between CPU and GPU implementations of simplified radix-2 [13] and radix-4 [16] methods can be found in [26].The paper concluded that the block cyclic reduction type methods considered in the paper are suitable for GPU computation and that the radix-4 method is better able to utilize GPU's parallel computing resources.The presented numerical results indicate up to 6-fold speed increase for two-dimensional Poisson equations and up to 3-fold speed increase for three-dimensional Poisson equations.The implementations were compared against equivalent multi-threaded CPU implementations run on a quad-core CPU.This paper deals with a direct method called partial solution variant of the cyclic reduction (PSCR).It is a cyclic reduction type method with MDA-type features and it can be applied to separable block tridiagonal linear systems.The initial work on the (radix-2) PSCR method was done in the 80's by Vassilevski [27,28] and Kuznetsov [29].The method uses so-called partial solution technique [30,31], which can be applied effectively to a separable linear system when only a sparse set of the solution components is required and the right-hand side vector has only a few non-zero elements.The technique is a special form of MDA.A more generalized radix-q algorithm was formulated later in [32] and a parallel radix-4 CPU implementation was presented in [33].Parallel implementations were also considered earlier in [34] and [35].Here the radix number q means that the system size is reduced by a factor of q on each reduction step.The usual formulation of the PSCR method only distantly resembles the formulation of traditional block cyclic reduction methods.However, in certain special cases, equivalent methods can be derived using a more traditional matrix rational polynomial approach [13,16].If the factor matrices are tridiagonal, then the arithmetical complexity of the PSCR method is O(n 1 n 2 log n 1 ) for n 1 n 2 × n 1 n 2 problems and O(n This paper presents a generalized GPU implementation of the radix-4 PSCR method and in a sense this paper can be seen as a natural follow-up to [16].The implementation can be applied to real and complex valued problems.The underlying partial differential equation (PDE) that induces the linear system can be two or three dimensional (i.e. each term in the tensor product form involves two or three factor matrices) and the factor matrices are assumed to be tridiagonal and symmetric.Certain factor matrices are assumed to be positive definite.
The implementation is tested using a Poisson equation with Dirichlet boundary conditions and a Helmholtz equation with second-order ABC.In the both cases, the domain is rectangular and the discretization is performed using an orthogonal finite-element mesh.In the case of the Poisson equation, the finiteelement space consists of piecewise linear elements, and in the case of the Helmholtz equation, the finite-element space consist of bilinear elements.The numerical results are analyzed using roofline performance analysis model [36].
The model takes into account the available off-chip memory bandwidth and thus provides a very extensive picture of how effectively computational and memory resources are being utilized.
The rest of this paper is organized as follows: Section 2 introduces the reader to the CR method.Section 3 describes the Kronecker matrix tensor product, the partial solution technique, and the PSCR method.Section 4 gives a brief introduction to GPU computing and describes the GPU implementation.Section 5 presents the numerical results and the roofline models.The final conclusions are given in Section 6.

95
The PSCR method resembles the CR method in many respects.Thus, we begin by describing the CR method in a simple setting in an attempt to introduce the reader to the basic idea of cyclic reduction.Consider a tridiagonal linear system where for some positive integer k.From now on, we assume that the field K is either R or C. We focus on the equation 2j: We can now eliminate the unknowns u 2j−1 and u 2j+1 from the equation 2j.
When we apply this procedure to all even-numbered equations in the system, we end up eliminating all odd-numbered unknowns.Furthermore, the new linear system is tridiagonal which means that we can apply this procedure recursively.More formally, we generate a sequence of tridiagonal sub-systems as follows: Let a where j = 1, 2, . . ., 2 k−i − 1.Then, for i = k − 1, k − 2, . . ., 0, we solve each sub-system using the formula where j = 1, 2, . . ., 100 Finally, u = u (0) .The CR method can be easily generalized for linear systems of arbitrary size in which case the arithmetical complexity of the method is O(m), where m is the number of equation in the system.

Kronecker matrix tensor product forms 105
The Kronecker matrix tensor product is defined for matrices B ∈ K n1×n1 and C ∈ K n2×n2 as follows The product has two properties which are the basis of many MDAs: First, let D ∈ K n1×n1 and E ∈ K n2×n2 .Then, Second, let D ∈ K n1×n1 and E ∈ K n2×n2 .If the matrices D and E are nonsingular, then product matrix D ⊗ C ∈ K n1n2×n1n2 is also nonsingular and These two results can be derived from the definition of the Kronecker matrix tensor product.Generally speaking, the PSCR method considered in this paper can be applied to a linear system Au = f , with where the factor matrices A 1 ∈ K n1×n1 and M 1 ∈ K n1×n1 are symmetric and tridiagonal, A 2 , M 2 ∈ K n2×n2 , and c ∈ K. Thus, the coefficient matrix A is symmetric and block tridiagonal.In addition, the factor matrix M 1 is positive definite.Linear systems of this form usually result from the discretization of two-dimension PDEs.The method can also be used to solve three-dimensional PDEs in which case the coefficient matrix A would be of the form where the factor matrices A 2 and M 2 are symmetric and tridiagonal, and A 3 , M 3 ∈ K n3×n3 .In addition, the factor matrices M 1 and M 2 are positive definite.From now on, we refer to linear systems with the coefficient matrices of the form (7) as two-dimensional problems and linear systems with the coefficient matrices of the form (8) as three-dimensional problems.
The PSCR method reduces a three-dimensional problem into a set of twodimensional problems where the coefficient matrices are of the form with λ ∈ K.The PSCR can be applied recursively to solve problems with (9).

Overview of the algorithm
This and the next two subsections describe the radix-q PSCR method using projection matrices similarly to [33].We now loosely define sets J 0 , J 1 , . . ., J k ⊂ N that will indirectly determine which block rows are eliminated during each reduction step.The exact formulation of the PSCR method depends on how these sets are chosen.In the case of the radix-q PSCR method, k = log q n 1 +1 and the sets fulfill the following conditions: . . . ) be the elements of the set Ĵi in ascending order, and Then #D (i) l ≤ q − 1 for all i = 1, 2, . . ., k and l = 1, 2, . . ., #J i + 1. Above #J i and #D (i) l are the cardinalities of the sets J i and D (i) l , respectively.The consequence of the third condition is that the rows, that are to be eliminated during a reduction step, are distributed into groups of a size of no more than q − 1. Examples of the sets J 1 , J 2 , . . ., J k are given in Figure 1 and in the following Section 3.4.An example of the sets J 0 , J 1 , J 2 , J 3 when q = 4 and n 1 = 32.Indexes that belong to each set are highlighted in black.The set J i implies that block rows, that correspond to indexes that belong to the set J i−1 \ J i , are eliminated from the system during the ith reduction step.
With the sets J 0 , J 1 , . . ., J k , we can define projection matrices with Based on these, we can define a second set of projection matrices: Under the assumption that a projected matrix P (i) AP (i) is nonsingular in subspace Im(P (i) ) for all i = 1, 2, . . ., k, the linear system Au = f can be solved in two stages: 1. Let f (1) = f .Then, for i = 1, 2, . . ., k − 1: Solve the vector v (i) from and compute 2. Let u (k+1) = 0.Then, for i = k, k − 1, . . ., 1: Solve the vector u (i) from and compute Finally, u = u (1) .
The rationale behind this recursive technique becomes clear after the following observations: Due to the special block tridiagonal structure of the coefficient matrix A, only a sparse set of the solution components are actually required by the update formulas ( 14) and ( 16), and the right-hand side vectors in the projected systems ( 13) and ( 15) have only a few non-zero elements.In this situation the partial solution technique [30,31] can be applied very effectively.In addition, the projected systems ( 13) and ( 15) decouple into several independent sub-systems.

Partial solution technique
We will now investigate how to solve independent sub-systems in ( 13) and ( 15).Let us focus on a separable linear system Ãv = g with where Ã1 ∈ K m×m and M1 ∈ K m×m are matching diagonal blocks from the matrices A 1 and M 1 , respectively.Let us have projection matrices R ∈ K m×m and Q ∈ K m×m defining the required solution blocks and the non-zero blocks of the right-hand side vector g, respectively.Based on these projection matrices, we construct two additional projection matrices: R = R ⊗ I n2 and Thus, g ∈ Im(Q) and instead of solving the whole vector v, we are going to solve only vector Rv.
The vector Rv can be solved effectively by using the formula: where the diagonal matrix Λ ∈ K m×m and the matrix W ∈ K m×m have the properties In other words, contain the eigenvalues λ 1 , λ 2 , . . ., λ m ∈ K and the M1 -orthonormal eigenvectors w 1 , w 2 , . . ., w m ∈ K m , which satisfy the generalized eigenvalue problem The formula (18) follows from the properties ( 5) and ( 6).
In the case of the radix-q PSCR method, the projection matrices R and Q contain no more than q + 1 non-zero blocks, which means that only q + 1 components from each eigenvector are required to compute Rv.If the factor matrices A 2 and M 2 are tridiagonal, then the arithmetical complexity of computing a partial solution is O(mn 2 ), the most expensive operation being the solution of m n 2 × n 2 tridiagonal linear systems.

Explicit formulas
This subsection gives explicit formulas in the special case when n 1 = q k − 1 for some positive integer k.The method can be generalized for arbitrarily n 1 but due to lengthy formulas we limit ourselves to presenting the method in this special case.However, we emphasize that the implementation presented in this paper is applicable for general n 1 .We begin by defining the sets J 0 , J 1 , . . ., J k ⊂ N as Clearly these sets fulfill the conditions enumerated in Section 3.2.
In summary, the radix-q PSCR method proceeds for a two-dimensional problem as follows: 1. Solve a set of generalized eigenvalue problems where q k−i ∈ K mi×mi are the non-zero diagonal blocks from the projected matrices P (i) A 1 P (i) and P (i) M 1 P (i) , respectively, in order starting from the upper left corner, and m i = q i − 1.This step can be considered as an initialization 165 stage.The numerical stability of the PSCR method depends largely on the properties of these generalized eigenvalue problems.

OpenCL and Nvidia's GPU hardware
A contemporary high-end Nvidia GPU contains thousands of processing elements (CUDA cores) which are grouped into multiple computing units (streaming multiprocessors).GPU-side code execution begins when a special kind of subroutine called kernel is launched.A kernel is executed in parallel by a large set of work-items (threads) and each work-item is given a unique index number which makes branching possible.The work-items are grouped into work groups, which are then mapped to the computing units.A computing unit can execute multiple work groups concurrently but each work group is mapped to only a single computing unit.
The processing elements inside the same computing unit share certain resources such as a register file, schedulers, special function units, and caches.The schedulers can issue instructions from multiple independent instruction streams (work-items) to the same processing element.This type of over-saturated processing element occupancy is actually a desirable situation because then the instruction pipeline can be kept more easily occupied.Because multiple processing elements share the same scheduler, the code is actually executed in a synchronous manner.The hardware handles branches by going through all necessary execution paths temporally disabling those processing elements that do not contribute to the final result.Nvidia uses the term warp to describe a set of 32 work-items that are executed together in this synchronized manner.
OpenCL has two primary memory spaces: Global memory is a large but relatively low bandwidth (off-chip) memory space, which can be accessed by all work-items.Attainable memory bandwidth depends largely on GPU's microarchitecture and the used access pattern.In case of older devices, a warp-coalesced access pattern, that is, a access pattern where the warp accesses memory locations that fall within the same 128 byte memory block / L1 cache line, is usually the fastest way to access the global memory.Newer devices can also benefit from warp-coalesced access but the benefits are not as substantial because only the L2 cache (32 byte cache line) can be used in most situations.
Local memory is a fast (on-chip) memory space which can be used when multiple work-items that belong to the same work group want to share data.The memory is divided into 32-bit (or 64-bit) memory banks organized such a way that successive 32-bit (or 64-bit) words map to successive memory banks.Each memory bank is capable of serving only one concurrent memory request.This limits the number of effective access patterns as simultaneous memory requests, that point to the same memory bank, cause a memory bank conflict and are processed sequentially.
In addition, there are two additional memory spaces called constant memory and private memory, but they are not used (explicitly) in our GPU implementation.

General notes
The GPU implementation is based on the radix-4 PSCR method.The radix-4 variant was chosen because it is relatively simple to implement but still nearly optimal in the sense of arithmetical complexity [33].Based on the numerical results presented in [26], it is also likely that a radix-4 method will outperform a radix-2 method on a GPU, especially when the problem size is small.
Our GPU implementation can be applied to problems where the factor matrices A 1 , A 2 , M 1 , and M 2 in (7) are tridiagonal and symmetric.For threedimensional problem, the factor matrices A 3 and M 3 in (8) are also assumed to be tridiagonal and symmetric.As mentioned earlier, the factor matrix M 1 (and the factor matrix M 2 when dealing with three-dimensional problems) is assumed to be positive definite.The field K can be either R or C. The system size can be arbitrary as long as the GPU has enough global memory to accommodate the right-hand side vector, the factor matrices, the eigenvalues, the required eigenvector components, guiding information, and workspace buffers.Here, the term guiding information refers to data structures that are used to map work groups to different computational tasks inside the kernels.
The generalized eigenvalue problems (23) are solved on the CPU-side.This is not a major limitation because the PSCR method is usually used when one desires solve a large set of linear systems with (nearly) identical coefficient matrices.If the factor matrix M 1 is not diagonal, then the generalized eigenvalue problem is preprocessed with Crawford's algorithm [37] leading to an ordinary eigenvalue problem with the same eigenvalues.The eigenvalues are then solved using the LR-algorithm [38] coupled with Wilkinson's shift.The eigenvectors are solved using the inverse iteration method after the eigenvalues have been computed.
OpenCL does not have a native support for complex numbers.In the GPU implementation presented in this paper, a complex number is presented using a vector of length two.Multiplications and divisions are implemented using suitable precompiler conditionals/macros.This means that the same codebase can be used for real and complex valued problems.
Many aspects of the implementation are parametrized.For example, each kernel has its own parameter that specifies the size of the work group.If the system is complex valued, then the solver can be configured to store the real and imaginary parts separately.Similarly, the solver can be configured to store the two halves of a 64 bit double precision floating-point number as two separate 32 bit chunks.In addition, several parameters control the way the tridiagonal solver behaves.Optimal parameters that minimize the overall execution time are selected by solving the arising integer programming problems using a differential evolution [39,40] method.

Upper level implementation (levels 1 and 2)
The implementation of the PSCR method is divided into three levels: The level 3 is the tridiagonal solver, which is responsible for solving the tridiagonal subproblems in (26) and (28).The details are given in the next subsection.
The level 2 forms the right-hand side vectors for the aforementioned tridiagonal subproblems and computes the vectors (24) and (27).The level 1 is analogous to the level 2 as it performs the same operations to a three-dimensional problem.When the tridiagonal solver is excluded, the implementation consists of the following seven kernels: lx stage 11 forms the right-hand side vectors for the subproblems in (26).lx stage 12a computes the vector sums in (24).lx stage y2b helps to compute the vector sums in (24) and (27).lx stage 12c computes the vector f (i+1) in (24).lx stage 21 forms the right-hand side vectors for the subproblems in (28).lx stage 22a computes the vector sums in (27).lx stage 22c computes the vector u (i) in (27).
As the levels 1 and 2 are analogous to each others, they use the same codebase.Level 1 kernels have a prefix l1 and level 2 kernels have a prefix l2.
The purpose of the kernel lx stage y2b is to distribute the task of computing the vector sums in (24) and (27).Each vector sum is divided into multiple partial sums which are then evaluated in parallel by the kernels lx stage 12a and lx stage 22a.The same division into partial sums is then repeated recursively by the kernel lx stage y2b using a tree-like reduction pattern.The size of each partial sum is set by a parameter.

Tridiagonal solver (level 3)
The tridiagonal solver used in our GPU implementation is a generalized and extended version of the CR based tridiagonal solver used in our previous paper [26].The extended tridiagonal solver is based on the CR, PCR, and Thomas [41] methods.The CR method has been shown to be reasonably effective on GPUs due to its simplicity and parallel nature (see, e.g., [19,20,21,22,23,24,25]).In its basic formulation, the CR method suffers from many well-known drawbacks.The two most significant ones of them, when GPUs are concerned, are: • The memory access pattern disperses exponentially as the method progresses.This causes problems with the global memory because warpcoalesced access is no longer possible and it causes problems with the local memory because work-items end up accessing data that is stored in the same memory bank.These problems are tackled in this paper by using different permutation schemes.
• The number of parallel tasks is reduced by a factor of two on each reduction step which leads to low processing element occupancy.This further leads to suboptimal floating point and memory performance.The situation is even worse in the complex valued case as the memory requirement is doubled when compared to the real valued case but the number of parallel tasks stays the same.
The PCR method uses the same reduction formulas as the CR method but the reduction operation is applied to every row in each reduction step.Thus, all sub-systems are the same size and the arithmetical complexity of the method is O(m log m).The benefits of using the PCR method are: a more GPU-friendly memory access pattern, which does not cause additional memory bank conflicts, and high processing element occupancy.The method is widely used for solving tridiagonal systems on GPUs; see, e.g.[19,25].And finally, the Thomas method is a special variant of the well-known LUdecomposition method for tridiagonal matrices.The arithmetical complexity of the method is O(m) and it is the most effective of all the three mentioned methods.However, this sequential algorithm is not suitable for GPUs on its own.Thus, the method is often combined with the PCR method (see, e.g., [19,22]).This is possible because a PCR reduction step splits a linear system into two independent sub-systems.After a few PCR reduction steps, the remaining subsystems can be solved effectively using the Thomas method.The tridiagonal solver that is used in our PSCR implementation has four main stages: Stage A is used when the reduced system does not fit into the allocated local memory buffer.It this case, the coefficient matrix and the right-hand side vector are stored in the global memory and the local memory is used to share odd numbered rows between work-items.Thus each odd-numbered row needs to be read only once from the global memory.The system is divided into sections which are then processed independently of each other.The size of each section is two times the size of the work group.
After a new sub-system has been computed using the CR method, each section is permuted in such a way that all odd numbered rows from the previous sub-system are stored in the upper half of the section and all rows from the new sub-system are stored in the lower half of the section.This segmentation and permutation operation is repeated after each reduction step.Figure 2 shows how the reduction stage proceeds.
The purpose of this permutation scheme is to store those rows that are needed during the next reduction step continuously in the global memory, thus avoiding the exponential memory access pattern dispersion problem mentioned earlier.Each reduction and back substitution step can be performed using multiple work groups, thus allowing a larger number of work-items to be employed per tridiagonal system.
Stage B is used when the reduced system fits into the allocated local memory buffer but the number of remaining even numbered rows is higher than the size of the used work group.Figure 2 shows how the reduction stage proceeds.We refer reader to [26] ("middle stage") for further details.The simplified implementation presented in [26] benefited this additional step but the benefits did not translate to the generalized implementation.The parameter optimizer kept this stage disabled during most test runs.Stage C is used when the reduced system fits into the allocated local memory buffer and the number of remaining even numbered rows is at most the same as the size of the used work group.Before the first reduction step, the system is permuted in such a way that all odd numbered rows are stored in the upper half of the vector and all even numbered rows are stored in the lower half of the vector.After the new sub-system has been computed using the CR method, each vector is permuted in such a way that all computed rows, that are going to be even numbered rows during the next reduction step, are stored in the first part of the vector, followed by all computed rows, that are going to be odd numbered rows during the next reduction step.This division and permutation operation is repeated after each reduction step.This permutation pattern seems to be identical with the one used in [20].
Stage D solves the remaining system using a PCR-Thomas hybrid method similar to [19,22].This stage can be disabled completely or the Thomasstep can be skipped, if desired.In our implementation this is decided by the differential evolution optimization of the execution time.
The permutations can be implemented effectively because the coefficient matrices are stored in a vector format and the reversal of the permutation pattern during the back substitution stage is necessary only in the case of the right-hand side vectors.The tridiagonal solver presented in [26] consisted of simplified versions of the stages A, B, and C. In total, the tridiagonal solver consists of five kernels.If it is necessary to use the global memory, then the following four kernels are used: l3 gen glo sys is responsible for forming the coefficient matrices into the global memory.If there is no need to use the global memory, then only one kernel (l3 bcd gen sys) is used to performs all necessary operations.The work group sizes, the amount of allocated local memory, switching points between different stages, the number of work groups per system, and many other properties are parametrized and, thus, optimized.

Test problems
The first set of test problems is made out of two-dimensional and threedimensional Poisson equations with Dirichlet boundary conditions: In a rectangle, the discretization using piecewise linear finite-elements on an orthogonal finite-element mesh leads the following factor matrices: where j = 1, 2 or j = 1, 2, 3. Above, h j,l is the lth mesh step in the jth coordinate axis direction.In addition, c = 0 in (7) or (8).
The second set of test problems is made out of approximations of the Helmholtz equation where d = 2, 3 and ω is the wave number.The second equation of ( 32) is the Sommerfeld radiation condition which poses u to be a radiating solution.Here i denotes the imaginary unit.The unbounded domain R d must be truncated to a finite one before finite element solution can be attempted.This means that we must approximate the radiation condition at the truncation boundary.
385 Two popular ways of achieving this are a PML [4,5] and ABC [1,2,3].If the truncated domain is rectangular and the problem is discretized using bilinear (or trilinear) finite elements on an orthogonal finite-element mesh, then the resulting coefficient matrices can be presented using the Kronecker matrix tensor product in a form which is suitable for the PSCR method (see, e.g., [6]).

390
A second-order ABC was chosen for numerical experiments.For the twodimensional Helmholtz equation, c = −ω 2 in (7) and the factor matrices are defined as and where , and c j,l = h j,l 6 . (35)

Comparisons and general analysis of the results
GPU experiments were carried out on a Nvidia GTX 1080 Ti GPU and an Intel i5-6600 CPU was used in CPU experiments.A single-threaded variant of the radix-4 PSCR CPU code presented in [33] was used as a CPU reference implementation.All the experiments were performed using double precision floating point arithmetic.Due to certain limitations in Nvidia's current OpenCL drivers, the GPU implementation could access only up to 4GB of global memory which limited the sizes of the numerical problems that could be considered.The GPU implementation can solve slightly larger systems than presented here but the parameter optimizer does not work properly when the required amount of global memory is close to 4GB.This is likely a memory fragmentation issue.Tables 1 and 2 show the results obtained with the two-dimensional test problems.The initialization times are excluded from the tabulated execution times.The initialization time depends on the properties of the generalized eigenvalue problems (23) but in general the initialization time is one order of magnitude longer than the actual solution time.Figure 4 shows the observed speedups when the Nvidia GTX 1080 Ti GPU was used instead of one core of the Intel i5-6600 CPU.The right-hand side vector transfer times are excluded from the speedup calculations.For the Poisson equations, the GPU is up to 16 times faster and for the Helmholtz equations, the GPU is up to 20 times faster.The size of the problem determines the number of work-items that can be used which further determines how effectively the processing elements can be utilized.In addition, each kernel launch causes some additional overhead which has the largest impact on small problems as shown in Figure 6.These two observations explains why the speedup goes up as the problem size increases.The 415 fact that the observed speedup flattens in the case the large Poisson equations will be discussed in Subsection 5.3.Tables 3 and 4 show the corresponding results for the three-dimensional test problems.The initialization times are again excluded from the tabulated execution times.However, unlike in the case of the two-dimensional test problems, 420 the initialization time is generally negligible when compared to the solution time.The speedup graph in Figure 5 shows that the GPU is up to 24 times faster at solving the Poisson equations and up to 18 times faster at solving the

Helmholtz equations.
Figures 6 and 7 show how the execution time is distributed among different kernels on the Nvidia GTX 1080 Ti GPU.For the small two-dimensional problems, the execution time is dominated by the overhead, as expected.For the larger two-dimensional problems, the execution time is dominated by the tridiagonal solver (l3-kernels).Therefore, the main efforts to improve the implementation should be directed toward the tridiagonal solver.Some effort should also be directed towards the kernel lx stage 21.The transition point where the tridiagonal solver begins to use the global memory (stage A) can be seen clearly at n 1 = 1023.For the three-dimensional problems, the overhead is negletable and the execution time is again dominated by the tridiagonal solver.
Based on numerical experiments, the GPU implementation appears to be as numerically stable as the reference CPU implementation.The parameter optimization provided some additional benefit when compared to non-optimized   generic parameters which try to maximize processing element utilization by using large work groups and utilizing the local memory as much as possible.In case of the largest two-dimensional problems the parameter optimization lead 440 up to 94% improvement.In other cases the improvements were more modest; average improvement being about 11%.

Roofline models
The roofline model [36] is a performance analysis tool which takes into account the available off-chip memory bandwidth.The model gives a much more 445 accurate picture than what can be obtained by simply counting the floating point operations (flop).This is especially true in the case of GPUs since the theoretical floating point performance can be remarkably high but the actual attainable floating point performance is limited in several ways, including the global memory bandwidth.The roofline model has been previously successfully applied to GPUs; see, e.g., [42].
Basically, the application of the roofline model produces a two-dimensional scatter plot, assigning "operational intensity" to the horizontal axis, and "attained floating point performance" to the vertical axis.Here the operational intensity is defined as operational intensity = number of floating point operations number of bytes of (off-chip) memory traffic .
The model includes two device specific upper limits for the attainable floating point performance.The first upper limit is the theoretical peak floating point performance of the device.The second upper limit is determined by the peak (off-chip) memory bandwidth and calculated by peak (off-chip) memory bandwidth × operational intensity.
The point where these two upper bounds intersect is called device specific balance point.If the operational intensity of an algorithm is lower than the device specific balance point, then the algorithm is considered to be memory-bound; otherwise the algorithm is considered to be compute-bound.The actual attained floating point performance should be close to the upper limit specified by the operational intensity of the algorithm.The analysis presented in this paper is based on computing an estimate for the number of floating point and memory operations on test run by test run basis.These estimates take into account the problem size and the parameters.The test run specific operational intensity and attained floating point performance were then derived from these estimates.
Figure 8 shows global memory roofline models for the Nvidia GTX 1080 Ti GPU.Based on the information provided by Nvidia, the theoretical peak double precision floating point performance was estimated to be about 332 GFlop/s (368 GFlop/s boosted) and the theoretical peak global memory bandwidth to be about 484 GB/s.This means that the device specific balance point is about 0.69 Flop/Byte, or alternatively, 5.49 flops per double precision number.
For the two-dimensional problems, the measurement points form two distinct clusters.The first cluster includes the cases in which the tridiagonal solver uses the global memory (stage A) and the second cluster includes the cases where the global memory is not used.For the two-dimensional Poisson equations, the clusters are located near (0.28 Flop/Byte, 136 GFlop/s) and (1.03 Flop/Byte, 148 GFlop/s).For the two-dimensional Helmholtz equations, the clusters are located near (0.41 Flop/Byte, 173 GFlop/s) and (1.13 Flop/Byte, 139 GFlop/s).
These roofline models indicate that the implementation is memory-bound for large two-dimensional problems.More careful look into the data reveals that the implementation becomes memory-bound when the tridiagonal solver switches to using the stage A. The models also explain the observed flattening in the speedup shown in Figure 4.The model predicts that obtainable floating point performance for the first cluster of two-dimensional Poisson equations is only 484 GB/s × 0.28 Flop/Byte = 136 GFlop/s, which matches the observed performance perfectly.Thus, as the second cluster reached 148 GFlop/s performance, the obtained floating point performance is expected to drop.In other cases the models do not completely explain the observed results as the obtained floating-point performance is slightly lower that what is predicted by the models.Several explanations were considered ranging from memory bank conflicts to potential GPU driver problems and limitations.In the end, we came to the conclusion that the most important issue to consider in this regard is the limited amount of local memory available in contemporary GPUs, which has the unfortunate side effect of severely restricting the number of work groups that can be executed simultaneously in a computing unit.This is likely to lead to a low processing element occupancy.
It appears that the CR-PCR-Thomas hybrid tridiagonal solver is unable to maintain the required processing element occupancy even though it is possible in principle by utilizing the PCR-Thomas hybrid method sufficiently.This is due to the fact that the increase in arithmetical complexity would negate any gains made.Actually, there was little discernible benefit from using the Thomas stage at all and in most cases only the CR and PCR stages were used.

Conclusions
This paper presented a generalized GPU implementation of the radix-4 PSCR method and numerical results obtained with the implementation for four test problems.The results indicate up to 24-fold speedups when compared to an equivalent CPU implementation utilizing a single CPU core.The highest speedups were obtained with the three-dimensional Poisson equations.The numerical results indicate that the presented GPU implementation is effecient and that GPUs provide demonstrable benefit in the context of the cyclic reduction and MDA methods.The presented roofline models indicate that the performance is, for the most part, limited by the available global memory bandwidth and the effectiveness of the tridiagonal solver used to solve the arising tridiagonal subproblems.The differential evolution parameter optimization improved the average performance by 11%.

Figure 1 :
Figure1: An example of the sets J 0 , J 1 , J 2 , J 3 when q = 4 and n 1 = 32.Indexes that belong to each set are highlighted in black.The set J i implies that block rows, that correspond to indexes that belong to the set J i−1 \ J i , are eliminated from the system during the ith reduction step.

Figure 2 :
Figure 2: Examples of the permutation patterns during the stage A (on the left) and stage B (on the right) of the tridiagonal solver.In this illustration, each work group contains four work-items.In practice, the work group size is always some multiple of the warp size.

Figure 3 :
Figure 3: An example of the permutation pattern during the stage C of the tridiagonal solver.In this illustration, each work group contains eight work-items.
l3 a1 performs one stage A CR reduction step in global memory.If only one work group is used per system, then the kernel performs all remaining stage A reduction steps.l3 a2 performs one stage A CR back substitution step in global memory.If only one work group is used per system, then the kernel mirrors the behavior of the kernel l3 a1.l3 bcd cpy sys copies the system into the local memory and performs the remaining stages B, C and D.

Figure 4 :
Figure 4: A execution time comparison between the Intel i5-6600 CPU and the Nvidia GTX 1080 Ti GPU for the two-dimensional problems.

Figure 5 :
Figure 5: A execution time comparison between the Intel i5-6600 CPU and the Nvidia GTX 1080 Ti GPU for the three-dimensional test problems.

Figure 6 :
Figure 6: Execution time distribution for the two-dimensional Helmholtz equations on the Nvidia GTX 1080 Ti GPU.Miscellaneous includes all remaining kernels.

Figure 7 :
Figure 7: Execution time distribution for the three-dimensional Helmholtz equations on the Nvidia GTX 1080 Ti GPU.Miscellaneous includes all remaining kernels.

Figure 8 :
Figure 8: Roofline models for the Nvidia GTX 1080 Ti GPU.The horizontal axis shows the operational intensity [Flop/Byte] and the vertical shows the attained floating point performance [GFlop/s].

Table 1 :
Execution and RAM-VRAM-RAM transfer times (in seconds) for the twodimensional Poisson equations (n 1 n 2 degrees of freedom).

Table 2 :
Execution and RAM-VRAM-RAM transfer times (in seconds) for the twodimensional Helmholtz equations (n 1 n 2 degrees of freedom).

Table 3 :
Execution and RAM-VRAM-RAM transfer times (in seconds) for the threedimensional Poisson equations (n 1 n 2 n 3 degrees of freedom).

Table 4 :
n 1 ,n 2 , n 3 Execution and RAM-VRAM-RAM transfer times (in seconds) for the threedimensional Helmholtz equations (n 1 n 2 n 3 degrees of freedom).