Memory access pattern matters

Posted by Filip Ekberg on July 1 2013 6 Comments

ignorance-is-blissLast week on BUILD there was a talk on Native Code Performance and Memory: The Elephant in the CPU, even though I didn’t attend BUILD this year I watched it on Channel9 and I’d like to recap something from it that I find interesting. In many cases I think that we ignore the fact that we need to handle performance optimizations ourselves, we can’t hide behind the CLR and the compiler all the time now, can we? Some say that Ignorance is Bliss, but in this case Ignorance might cause a lot of headache.

Let’s start by taking an example from Eric Brumer’s talk. Consider that we have the following three multi-dimensional arrays:

int N = 1000;
var A = new int[N, N];
var B = new int[N, N];
var C = new int[N, N];

Each matrix can be visualized like this, but in our case let’s use random numbers later on. So as you can see it grows to the size of N in both X and Y axis.

Matrix

We can use the following code to fill these with random numbers:

var random = new Random();
for(int x = 0; x < N; x++)
{
    for(int y = 0; y < N; y++)
    {
        A[x, y] = random.Next();
        B[x, y] = random.Next();
        C[x, y] = random.Next();
    }
}

Let’s now multiply A with B and store the result in C. This is where it all gets interesting. Let us first take a look at the code to perform this multiplication:

for (int i = 0; i < N; i++)
    for (int j = 0; j < N; j++)
        for (int k = 0; k < N; k++)
            C[i,j] += A[i,k] * B[k, j];

Running this on my machine takes about 12 seconds. We can do much better than that though! So what can we do in order to speed up the performance?

We can look at the memory access!

To some of you it might be obvious that when you access elements in memory that are closer to each other, this is faster than when accessing random memory over and over again. An array, or a row in a multi-dimensional array are stored in a sequence in memory. Here’s an example of that:

MemoryAlignment

Let’s focus on B[k, j]. The k in this case tells us which row to load data from and the j which column. This means that when we are in the first iteration we will load the first value from the first column and the second time we will load the first value from the second row like this:

MemoryAlignment-2

This is really bad because when we ask for the data in [0, 0] we will also have the surrounding data loaded into the cache. This means that the second time we iterate we will have to load a completely new chunk of data into the cache/memory because we don’t have access to the row that we want.

So how do we fix this?

It’s quite simple and probably obvious. We can just swap the two inner loops so that we instead process each column in each row before we change row:

for (int i = 0; i < N; i++)
    for (int k = 0; k < N; k++)
        for (int j = 0; j < N; j++)
            C[i, j] += A[i, k] * B[k, j];

This actually resulted in 60% faster execution on my machine and in low level programming (C/C++) I imagine the impact would be much larger. The reason that I wanted to bring all this up is because it’s important that we don’t forget about these things, even if we live in a managed world. I also strongly advice to go over to Channel9 and watch that presentation by Eric Brumer.

The samples in this post are the same that Eric brings up in his presentation among a lot of other great examples of how we should work with our code memory efficiently. Read more about Optimize Data Structures and Memory Access Patterns to Improve Data Locality.

Is this obvious? Do you consider these things in your day-to-day work? Leave your thoughts in the comment field below!

Here’s a complete code sample that I used to measure the times:

int N = 1000;
var A = new int[N,N];
var B = new int[N, N];
var C = new int[N, N];
var random = new Random();
for(int x = 0; x < N; x++)
{
    for(int y = 0; y < N; y++)
    {
        A[x, y] = random.Next();
        B[x, y] = random.Next();
        C[x, y] = random.Next();
    }
}
 
var watch = new Stopwatch();
watch.Start();
 
for (int i = 0; i < N; i++)
    for (int j = 0; j < N; j++)
        for (int k = 0; k < N; k++)
            C[i,j] += A[i,k] * B[k, j];
 
watch.Stop();
 
Console.WriteLine("Time: {0}ms", watch.ElapsedMilliseconds);
 
 
watch.Reset();
watch.Start();
 
// 60% Faster than the loop above!
for (int i = 0; i < N; i++)
    for (int k = 0; k < N; k++)
        for (int j = 0; j < N; j++)
            C[i, j] += A[i, k] * B[k, j];
 
watch.Stop();
 
Console.WriteLine("Time: {0}ms", watch.ElapsedMilliseconds);

Vote on HN

6 Responses to Memory access pattern matters

  1. gmarkovNo Gravatar says:

    Are there too many watch.Starts ?

  2. Filip EkbergNo Gravatar says:

    Thanks. I’ve fixed that. Fortunately I still get the same results!

  3. Chris LomontNo Gravatar says:

    You can make it again much faster by treating the large matrices as smaller (cache friendly) ones, and doing those instead of the big one.

  4. Omkar SinghNo Gravatar says:

    Its really great. After reading your post i just realized how often we ignore the basis.
    Thanks for bringing it up.

  5. Pingback: Dew Drop – July 2, 2013 (#1,577) | Alvin Ashcraft's Morning Dew

  6. Kasper HjNo Gravatar says:

    Thumbs up for doing the actual performance measurement! I believe that every programmer should know of caching and branch prediction even is programming in a high(er)-level language. For fun, I tested your code with A, B and C as arrays of references (some class I created and initialized an integer = random), and the performance gain was ~400%. It goes to show, that if you know about the low-level stuff, you become a better — and more complete — programmer.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>