Lessons From Parallelizing Matrix Multiplication

Before you dismiss this as yet another boring matrix multiplication example, consider that this, this, this and a lot of other tutorials on the web all use a naive version of matrix multiplication and then attempt to parallelize that version of it to demonstrate how to speed things up using some technology.

What do I mean by a naive version (since there are many naive ways of doing things depending on the system that we are talking about)? This is the basic C++ code for matrix multiplication used in most examples:

void matmult(int m, int n, int p, double **A, double **B, double **C) {
  for (int i = 0; i < m; i++) {
    for (int j = 0; j < n; j++) {
      for (int k = 0; k < p; k++) {
        C[i][j] += A[i][k] * B[k][j];
      }
    }
  }
}

And this is the smarter version which shouldn't surprise anyone who has taken a computer architecture course:

void matmult(int m, int n, int p, double **A, double **B, double **C) {
  for (int i = 0; i < m; i++) {
    for (int k = 0; k < p; k++) {
      for (int j = 0; j < n; j++) {
        C[i][j] += A[i][k] * B[k][j];
      }
    }
  }
}

What's the difference? The difference is pretty subtle so you might have missed it: the inner loops have been swapped i.e. the j and k iterations.

What can changing two lines do? After all, the results are the same right? Well, take a look at this output from the Shark profiler tool in OS X.

Profiling Matrix Multiplication in Shark
L2 Cache Misses Using Shark on OS X 10.5

The version on the left uses the naive algorithm and has 3 484 436 L2 cache misses for a 1024 x 1024 matrix multiplication. The version on the right uses the smarter algorithm and has 27 176 L2 cache misses. On my Core Duo MacBook Pro, the naive version runs in 52 133 ms and the smarter algorithm runs in 13 730 ms. That's nearly a 4x speedup and we didn't even parallelize anything yet!

Hopefully that last line caught your attention. That's the result on my MacBook Pro with a fairly old two-core processor. Let's run some experiments on a 24-core machine (6 cores per socket) using Intel Xeon CPU E7450 @ 2.40GHz running Linux 2.6.18. The code was compiled using g++-4.2 with -O3.

The naive version runs in 15 962 ms. The smarter version runs in 1 860 ms. Let's try parallelizing the naive version with OpenMP:

void matmult(int m, int n, int p, double **A, double **B, double **C) {
#pragma openmp parallel for
  for (int i = 0; i < m; i++) {
    for (int j = 0; j < n; j++) {
      for (int k = 0; k < p; k++) {
        C[i][j] += A[i][k] * B[k][j];
      }
    }
  }
}

Graph
Execution Time of Parallelized Matrix Multiplication.

Notice that we need to use ~9 threads to break even with the smarter version. Incidentally, if we use the smarter algorithm and parallelize it using 24 threads, it runs in 151 ms!

So what are the lessons from this?


comments powered by Disqus