Matrix Multiplication: Optimizing the code from 6 hours to 1 sec

Photo by Luan Gjokaj on Unsplash

We millenials have been blessed by so many things and unlimited RAM and Computation is one of them!

Professor Charles Leiserson says that Performance Engineering is a lost art. Having never thought about squeezing performance from the system, I agree with him.

Introduction and Hardware Specs

In this write-up, I will share my story in which I decided to implement efficient Matrix Multiplication.

I did my experiments on aws c4.8xlarge. It has 36 vCPUs and 60 GiB of Memory. I know this looks quite heavy machine to do experiments, but you will get to know soon why I chose this machine. The ec2 instance also has 2.9 GHz Intel Xeon E5–2666 v3 Processor and Intel AVX†, Intel AVX2†, Intel Turbo.

In case you are wondering, Intel AVX1/2 is 256/512-bit instruction set extensions designed for applications that are Floating Point (FP) intensive such as Image processing, Financial data analytics etc.

reference github repo —

Defining a baseline

We start the experiment by implementing a matrix multiplication on two 4096x4096 sized matrix in python.

Python implementation can be found here.

In this baseline code, I simply looped over all the elements —

for i in tqdm(range(n)):    
for j in range(n):
for k in range(n):
C[i][j] += A[i][k] * B[k][j]

In order to complete the processing, this code took around 6 hours!

Screenshot of htop

We can see that only 1 core was being used by python3. Now, can we do better?

Rewriting in cpp

I wrote the exact code in cpp, you can find the implementation here.

We found out it took 1363.172706 sec, that is 22 mins!

However, this also uses only 1 core.

The following can be the reason why we got 16x times improvement from our baseline —

  • cpp is compiled whereas python is interpreted

Valgrind magic

We used the cachegrind tool available in valgrind and found out that the cache miss rate was really high!

valgrind --tools=cachegrind ./m.out

We did a little bit of research here and found out that if we change the loop order it keeps the correctness while the speed increase because cache hit increases.

for(int i = 0; i < n; ++i) {        
for(int k = 0; k < n; ++k) {
for(int j = 0; j < n; ++j) {
C[i][j] += A[i][k] * B[k][j];

Note — we changed the order of j and k!

We then measured the performance of this matrix multiplication. The code took 254 sec, i.e 4 mins!

This is 90x improvement from the baseline model!

Some Compiler Help

There are some optimization levels which you can pass while compiling your code.

You can read about it more here —

We now compiled the code with O3 flag.

g++ -O3 code.cpp -o code.out

The code took 46 seconds to multiply two 4096x4096 sized matrices.

This is around 500x times improvement from the baseline model!

SIMD — Let’s Parallelize

Okay, so far we had been using single cpu of a 36 cpu giant!

In order to parallelize the matrix multiplication, we used openmp.

You can find the complete code here.

#pragma omp parallel for private(i,j,k) shared(A,B,C)    
for(i = 0; i < n; ++i) {
for(int k = 0; k < n; ++k) {
for(j = 0; j < n; ++j) {
C[i][j] += A[i][k] * B[k][j];

Now all the cores are being used!

This code when compiled with all the above steps mentioned, took 4 seconds!

This is around 6000x times improvement from the baseline code!

Read CLRS Algorithms book: Changing the algorithm

We found out that there is an efficient way to do multiplication of matrix. The code can be found here.

Strassen’s Algorithm — It is an divide and conquer way to do matrix multiplication.

We also changed the loop order in this algorithm and compiled -03.

This code took around 2.5 secs.

This is around 9000x times improvement from baseline.

Adding some Intel architecture support

So far so good, But can we touch 1 sec?

We know that our CPU architecture supports AVX2. This is Intel’s instruction set to help in vector math.

g++ -O3 -march=native -ffast-math matrix_strassen_omp.cpp -fopenmp -o matr_satrassen

This code took 1.3 secs to finish matrix multiplication of two 4096x4096 sized matrices.

A 17000x times improvement from the baseline! Is that how success tastes like? :P

Photo by bruce mars on Unsplash


This experiment was for my learning. I think this blog post was able to convey various ways of doing optimization of your code.

Also, this is certainly not the best matrix multiplication implementation. Numpy takes 0.6 secs multiply two 4096x4096 matrices.

Read more on Optimization and Performance Engineering —

I love stuff about data. Data Engineering and Data Science | Computer Vision | Mathematics | Contact me — vaibhaw[dot]vipul[at]gmail[dot]com