Vc  0.7.5-dev
SIMD Vector Classes for C++
Introduction

If you are new to vectorization please read this following part and make sure you understand it:

  • Forget what you learned about vectors in math classes. SIMD vectors are a different concept!
  • Forget about containers that also go by the name of a vector. SIMD vectors are a different concept!
  • A vector is defined by the hardware as a special register which is wider than required for a single value. Thus multiple values fit into one register. The width of this register and the size of the scalar data type in use determine the number of entries in the vector. Therefore this number is an unchangeable property of the hardware and not a variable in the Vc API.
  • Note that hardware is free to use different vector register widths for different data types. For example AVX has instructions for 256-bit floating point registers, but only 128-bit integer instructions.
Example 1:

You can modify a function to use vector types and thus implement a horizontal vectorization. The original scalar function could look like this:

void normalize(float &x, float &y, float &z)
{
const float d = std::sqrt(x * x + y * y + z * z);
x /= d;
y /= d;
z /= d;
}

To vectorize the normalize function with Vc, the types must be substituted by their Vc counterparts and math functions must use the Vc implementation (which is, per default, also imported into std namespace):

void normalize(float_v &x, float_v &y, float_v &z)
{
const float_v d = Vc::sqrt(x * x + y * y + z * z);
x /= d;
y /= d;
z /= d;
}

The latter function is able to normalize four 3D vectors when compiled for SSE in the same time the former function normalizes one 3D vector.

For completeness, note that you can optimize the division in the normalize function further:

const float_v d_inv = float_v::One() / Vc::sqrt(x * x + y * y + z * z);
const float_v d_inv = Vc::rsqrt(x * x + y * y + z * z); // less accurate, but faster

Then you can multiply x, y, and z with d_inv, which is considerably faster than three divisions.

As you can probably see, the new challenge with Vc is the use of good data-structures which support horizontal vectorization. Depending on your problem at hand this may become the main focus of design (it does not have to be, though).

Alignment

What is Alignment

If you do not know what alignment is, and why it is important, read on, otherwise skip to Tools. Normally the alignment of data is an implementation detail left to the compiler. Until C++11, the language did not even have any (official) means to query or modify alignment.

Most data types require more than one Byte for storage. Thus, even most atomic data types span several locations in memory. E.g. if you have a pointer to float, the address stored in this pointer just determines the first of four Bytes of the float. Naively, one could think that any address (which belongs to the process) can be used to store such a float. While this is true for some architectures, some architectures may terminate the process when a misaligned pointer is dereferenced. The natural alignment for atomic data types typically is the same as their size. Thus the address of a float object should always be a multiple of 4 Bytes.

Alignment becomes more important for SIMD data types. 1. There are different instructions to load/store aligned and unaligned vectors. The unaligned load/stores recently were greatly improved in x86 CPUs. Still, the rule of thumb says that aligned loads/stores are faster. 2. Access to an unaligned vector with an instruction that expects an aligned vector crashes the application. Once you write vectorized code you might want to make it a habit to check crashes for unaligned addresses. 3. Memory allocation on the heap will return addresses aligned to some system specific alignment rule. E.g. Linux 32bit aligns on 8 Bytes, while Linux 64bit aligns on 16 Bytes. Both alignments are not strict enough for AVX vectors. Worse, if you develop on Linux 64bit with SSE you won't notice any problems until you switch to a 32bit build or AVX. 4. Placement on the stack is determined at compile time and requires the compiler to know the alignment restrictions of the type. 5. The size of a cache line is just two or four times larger than the SIMD types (if not equal). Thus, if you load several vectors consecutively from memory every fourth, second, or even every load will have to be read from two different cache lines. This is called a cache line split. They lead to degraded performance, which becomes very noticeable for memory intensive code.

Tools

Vc provides several classes and functions to get alignment right.

  • Vc::VectorAlignment is a compile time constant that equals the largest alignment restriction (in Bytes) for the selected target architecture.
  • Vc::VectorAlignedBase and Vc::VectorAlignedBaseT are helper classes that use compiler specific extensions to annotate the alignment restrictions for vector types. Additionally they reimplement new and delete to return correctly aligned pointers to the heap.
  • Vc::malloc and Vc::free are meant as replacements for malloc and free. They can be used to allocate any type of memory with an abstract alignment restriction: Vc::MallocAlignment. Note, that (like malloc) the memory is only allocated and not initialized. If you allocate memory for a type that has a constructor, use the placement new syntax to initialize the memory.
  • Vc::Allocator is an STL compatible allocator class that behaves as specified in the C++ specification, implementing the optional support for over-aligned types. Therefore, memory addresses returned from this allocator will always be aligned to at least the constraints attached to the type T. STL containers will already default to Vc::Allocator for Vc::Vector<T>. For all other composite types you want to use, you can take the VC_DECLARE_ALLOCATOR convenience macro to set is as default.
  • Vc::Memory, Vc::Memory<V, Size, 0u>, Vc::Memory<V, 0u, 0u> The three different variants of the memory class can be used like a more convenient C-array. It supports two-dimensional statically sized arrays and one-dimensional statically and dynamically sized arrays. The memory can be accessed easily via aligned vectors, but also via unaligned vectors or gathers/scatters.