Figure 2: Transposition of the first half of an 8×8 byte block. Each row is displayed as a packed 8-byte number. Since only one unpack operation per cycle can be executed, they were intermixed with other instructions to improve parallelism. Matrix transposition cannot be done in place unless the matrix is square.
Copyright © 1999, Dr. Dobb's Journal