Restructuring and Implementations of 2D Matrix Transpose Algorithm Using SSE4 Vector Instructions
Abstract
Current general-purpose processors are augmented with vector instructions that can
process many elements of matrices and vectors in parallel. Transposing a matrix in-place
is a main kernel operation required by many scientific and engineering applications to
shuttle data before, during, or after processing. This operation increases the traffic on
the memory bus and hence clever techniques such as blocking are required to enhance
the performance. In this paper, we present an enhanced version of a previously published
algorithm for transposing a matrix on a two-dimensional processor arrays. We restructured this algorithm to fit the one-dimensional vector register architecture augmented to generalpurpose CPUs. We implemented the new vector algorithm using Intel SSE4 vector instruction set and compare its performance with the standard sequential algorithm in addition to an already employed implementation of Ekhlundh’s algorithm. We also studied the automatic compiler optimizations and their effect on the vectorization of the algorithm. The best of our implementations showed a maximum speedup of 1.6 compared with the sequential algorithm, and an almost equal performance compared with Eklundh’s algorithm implementation.
Journal/Conference Information
International Conference on Applied Research in Computer Science and Engineering,