GLU – GPU-Accelerated Sparse Parallel LU Factorization Solver
Version 2.0 (with enhanced numerical stability)
Dr. Sheldon Tan (PI)
Department of Electrical Engineering,
University of California – Riverside
· National Science Foundation, “SHF: Small: GPU-Based Many-Core Parallel Simulation of Interconnect and High-Frequency Circuits”, (CCF-1017090), 9/1/2010-8/30/2014, PI: Sheldon Tan
We thank National Science Foundations for supports of this research work. Any opinions, findings, and conclusions or recommendations expressed in those materials below are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.
Transforming a sparse matrix into its LU form is of crucial importance in linear algebra as it plays an important role in many numerical and scientific computing applications such as finite difference and finite element based methods. LU factorization operation represents the dominant computing cost in those problems and it is very important to improve the efficiency of the LU factorization algorithms. LU factorization for sparse matrices is the most important computing step for general circuit simulation problems for circuit designs. But parallelizing LU factorization on the popular many-core platforms such as Graphic Processing Units (GPU) turns out to be a difficult problem due to intrinsic data dependency and irregular memory access, which diminish GPU computing power.
Modern computer architecture has shifted towards the multi-core processor and many-core architectures. The family of GPU is among the most powerful many-core computing systems in mass-market use. For instance, the state-of-the-art NVIDIA Kepler K40 GPU with 2880 cores has a peak performance of over 4~TFLOPS versus about 80--100~GFLOPS of Intel i7 series Quad-core CPUs. In addition to the primary use of GPUs in accelerating graphics rendering operations, there has been considerable interest in exploiting GPUs for general purpose computation (GPGPU).
GLU was GPU-accelerated parallel sparse LU factorization solver developed by VSCLAB at UC Riverside. GLU is based on a hybrid right-looking LU factorization algorithm. We show that more concurrency can be exploited in the right-looking method than the left-looking method, especially on GPU platforms.
GLU has the following features and advantages over existing sparse LU solvers:
· It is base on the hybrid column-based right-looking LU factorization method, which is shown to be more amenable for exploiting the concurrency of LU factorization. The new method preserves the benefit of column-level concurrency and symbolic analysis in the left-looking method meanwhile, it allows more parallelism to be exploited in GPUs.
· GLU solver allows the parallelization of all three loops in the LU factorization on GPUs. In contrast, the existing GPU-based left- looking LU factorization approach can only allow two-level parallelization.
We conduct comprehensive studies on the new GPU LU solver on a number of published matrices, circuit matrices, self-made large circuit matrices against some existing LU solvers such as PARDISO, KLU, UMFPACK to demonstrate the advantage of the proposed parallel LU solver. In the tables, GPU-LL is a recently published GPU-accelerated left-looking sparse LU solver. Following is the performance comparison with existing LU sparse solver.
Table I lists the benchmark matrices we used. The matrix benchmark matrices come from University of Florida Sparse Matrix Collection (UFL). Table II gives the comparison results. The GLU was implemented in CUDA 5.0 and the experimental results are carried out in a Linux server with two 8-Core Xeon E5-2670 CPUs, DDR3-1600 64GB memory. The server also consists of one K40c GPU and one K20 GPU, which serve as the GPU platforms for the proposed algorithms. Note that all the GPU results are obtained from the K40c GPU.
Numerical results show that the proposed GLU solver can deliver 5.71X and 1.46X speedup over single-threaded and the 16-threaded PARDISO solver respectively, 19.56X speedup over the KLU solver, 47.13X over the UMFPACK solver and 1.47X speedup over a recently proposed GPU-based left-looking LU solver on the set of typical circuit matrices from University of Florida Sparse Matrix Collection (UFL). Furthermore, we also compared the proposed GLU solver on a set of general matrices from UFL, GLU achieves 6.38X and 1.12X speedup over the single-threaded and the 16-threaded PARDISO solver respectively, 39.39X speedup over the KLU solver, 24.04X over the UMFPACK solver and 2.35X speedup over the same GPU-based left-looking LU solver. Also comparison on self-generated RLC mesh networks shows a similar trend, which further validates the advantage of the proposed method over the existing sparse LU solvers.
Recently, we have further improved the GLU solver and release the GLU 2.0. The new GLU 2.0 has the following new features. (1) we add the HSL MC64 algorithm as part of the pre-process to make the diagonal element dominants to improve the numerical stability of the pivoting process. (2) we rewrote the parallel level detection methods to ensure all the dependency of columns of given matrices in the left-looking LU method. (3) we applied a forced perturbation technique to remove make numerical diagonal zeros (by adding small diagonal values), which has been used in other commercial LU factorization. The new GLU 2.0 manual and guide can be found here.
GLU solver version 2.0 (with improved numerical stability), which includes all the sources codes, user manual and some examples, can be found here.
Please send any problem, bug and comment regarding GLU 2.0 to Prof. Sheldon Tan at email@example.com.
J1. K. He, S. X.-D. Tan, H. Wang and G. Shi, “GPU-Accelerated Parallel Sparse LU Factorization Method for Fast Circuit Analysis”, IEEE Transactions on Very Large Scale Integrated Systems (TVLSI), Vol. 24, No. 3, pp. 1140-1150, 2016.