SyntaxHighlighter

mercredi 29 mai 2013

What is vectorization or how to improve your programs performances ?

What is vectorization ?

Vectorization is a very interesting feature proposed by the CPU (remember MMX cores more than 10 years ago). Honestly, not using vectorization when trying to optimize a program is like having a good meal wihtout a nice bottle of wine ! That's crazy !
To understand vectorization, we must explain very simply the CPU architecture. A CPU is an electronic device able to:

  • Retrieve data from the memory (load instruction)
  • Save data in the memory (store instruction)
  • Execute operations (add, mul, div, sub, etc.)
We could represent a very basic CPU by this picture:

But in reality, it's a little bit more complicated. Actually, most of the operations computed by the CPU use registers. A register is a small memory able to store a value. Few years ago, computers only add 32 bits registers but today, most of them have 64 bits registers. 

When you program in C/C++ on a 64 bits architecture, your program always use those registers. Most of the data types that you use are made to fit on 32 bits or on 64 bits. For example, on a 64 bits x86 architecture:
  • a int is stored on 32 bits
  • a long is stored on 64 bits
  • a float is stored on 32 bits
  • a double is stored on 64 bits
But modern CPU also have very special registers that can improve the performances of your programs if you know how to use them. Those registers are named:
  • xmm1, xmm2, xmm3, ... (128 bits registers on most modern CPUs)
  • ymm1, ymm2, ymm3, ... (256 bits registers on brand new CPUs)
You probably wonder what could be the benefit of using 128 bits integers or floating point numbers ? ...Actually, we don't use those registers to store very big numbers. They are used to store vectors of numbers.
For example, a 128 bits register can be used to store 4 floats or 2 doubles. Programming with those vectors is called vectorization. Instead of storing a 128 bits integer in your register, you will store 4 integers. 

What kind of operation can those registers compute ?

The xmm and ymm registers can be used to achieve most of the usual 64 bits instructions such as:
  • additions
  • subtractions
  • multiplications
  • divisions
  • shuffles
  • etc.
But the great thing is that with only one instruction, on 128 bits registers, you can compute 4 additions. Here is a example:

As you can see on this picture, using vectorization can decrease the number of instructions executed by your program. If you compare the number of instructions executed by the CPU on this example, you will have 4 times more instructions on the normal program than on the vectorized program.

You're probably wondering if this also means a x4 acceleration ? Actually no! Most of those instructions need more cycles to be achieved. But don't be disappointed  you will see soon enough the kind of speed improvement that we can get with vectorization.

Using the Intel Compiler

In this article (and in the next), we will use the Intel C++ compiler. This compiler is named icpc and is very good to create auto-vectorization. You can actually get a free 1 year licence for non commercial use. I really recommend you to download the free version of parallel studio. In addition to the compiler, you will also have vtune amplifier which is, for what I know, the best optimization profiler that exists. Parallel Studio also contains a tool (Inspector) that can help you debug your parallel programs.
In the following articles, I won't explain how to install Parallel Studio but you will be able to find a lot of documentation on this specific problem on the Internet.

Also, for the 1 year non-commercial version of Parallel Studio, you can get it from here:

The following articles will show you how to take advantage of auto-vectorization with icpc. We will also learn how to use Vtune Amplifier to profile our program and to examine the generated code.

Working on a simple program

In my opinion, it seems important to begin with a very simple example when leaning vectorization. The algorithm that I propose here is very simple. We have 3 arrays of floats, a, b and c. We want to compute:

for(unsigned int k=0; k < KVAL; k++){
 for(unsigned long i=0; i < SIZE; i++){
  c[i] += a[i] + b[i];
 }
}

Of course, we need to add few other lines of code because we want to:

  • Generate random values for a and b
  • Compute the amount of time needed by the loops
  • Check that our result is correct

Getting the time

In HPC (High Performance Computing), getting the execution time of a part of the code is complicated. How must we compute the time ? Do we need to compute the absolute time in second needed by our function to get a result ? Must we use the number of clocks ? 
To make it short, using the absolute time is a bad idea. During its execution, the program can be paused by the system for few milliseconds (and even more). Anyway, debating on how to compute the time spent is not our focus so I will be using an OpenMP function to compute the time. This function returns a double value representing a time in millisecond.

We can now add few other lines to our code:

double t_before = omp_get_wtime();

//The main loop
for(unsigned int k=0; k < KVAL; k++){
 for(unsigned long i=0; i < SIZE; i++){
  c[i] += a[i] + b[i];
 }
}

double t_after = omp_get_wtime();

std::cout<<"time ellapsed : " << ((t_after - t_before)*1000) << std::endl;

Initializing the arrays

For now, our arrays are not yet initialized. We want to generate arrays large enough to get a significant execution time. For this reason, arrays can't be allocated on the stack (the size of the stack is very limited and even if we can increase it, that's not what we are going to do here).
The first method is to use the well known malloc function. Let's first define some values:

#define SIZE 100000000
#define KVAL 10

Then we can create and allocate our arrays as follow:

float *a, *b, *c, *d;

//Initialization for random numbers
srand(time(NULL));

//memory allocation
a = (float*) malloc(SIZE*sizeof(float));
b = (float*) malloc(SIZE*sizeof(float));
c = (float*) malloc(SIZE*sizeof(float));
d = (float*) malloc(SIZE*sizeof(float));

Now that the arrays are allocated, let's fit a and b with random values:

generate_values(a, SIZE);
generate_values(b, SIZE);
memset(c, 0.0f, SIZE*sizeof(float));
memset(d, 0.0f, SIZE*sizeof(float));

The function memset fill the arrays c and d with 0 values. The call to generate_value generate random values for a and b. The definition of this function is:

void generate_values(float* tab, const unsigned long size){
   for(unsigned long i=0; i < size; i++){
      tab[i] = rand()%1000;
   }
}
We are almost done for our basic program. But as you can imagine, we want an other implementation of the computed formula. This will assure that the result computed by our optimized versions is correct. The way we are working is not rigorous but it will be enough for this succession of articles.

bool value_ok = true;
for(unsigned long i=0; i < SIZE; i++){
   for(int k=0; k < KVAL; k++)
      d[i] += a[i] * b[i];
   if(d[i] != c[i]){
      value_ok = false;
      std::cout << "result wrong for i=" << i << " " << d[i] << "!=" << c[i] << std::endl;
      break;
   }
}

std::cout << "time ellapsed : " << ((t_after - t_before)*1000) << " ms" << std::endl;
if(value_ok)
   std::cout << "Result was correct..." << std::endl;
else
   std::cout << "Result was wrong..." << std::endl;

We are done with the basic code now. The full code is available in the attached zip file.

Compile and run

The zip file contains a file named compile_main_1.sh. This file can be executed with the command line:
cedric:~/vecto #./compile_main_1.sh                                                

The compiling line is pretty simple:
icpc -o run1 -O0 main1.cpp -openmp

Here is the definition of the options used:

  • -o run1 defines that the output (the executable) will be named run1
  • -O0 means that we are not using any compiling optimization (including no vectorization at all). 
  • -openmp is used to link openMP (we are using the function opm_get_wtime())


A file named run1 should be built. Just type the following command to run it:
cedric:~/vecto #./run1                                                                           

And here is the resut on my laptop:


Our mission is to improve this reference time by playing with the compiler and the source code.