/*! \page volk_guide Instructions for using Volk in GNU Radio \section volk_intro Introduction Volk is the Vector-Optimized Library of Kernels. It is a library that contains kernels of hand-written SIMD code for different mathematical operations. Since each SIMD architecture can be greatly different and no compiler has yet come along to handle vectorization properly or highly efficiently, Volk approaches the problem differently. For each architecture or platform that a developer wishes to vectorize for, a new proto-kernel is added to Volk. At runtime, Volk will select the correct proto-kernel. In this way, the users of Volk call a kernel for performing the operation that is platform/architecture agnostic. This allows us to write portable SIMD code. Volk kernels are always defined with a 'generic' proto-kernel, which is written in plain C. With the generic kernel, the kernel becomes portable to any platform. Kernels are then extended by adding proto-kernels for new platforms in which they are desired. A good example of a Volk kernel with multiple proto-kernels defined is the volk_32f_s32f_multiply_32f_a. This kernel implements a scalar multiplication of a vector of floating point numbers (each item in the vector is multiplied by the same value). This kernel has the following proto-kernels that are defined for 'generic,' 'avx,' 'sse,' and 'orc.' \code void volk_32f_s32f_multiply_32f_a_generic void volk_32f_s32f_multiply_32f_a_sse void volk_32f_s32f_multiply_32f_a_avx void volk_32f_s32f_multiply_32f_a_orc \endcode These proto-kernels means that on platforms with AVX support, Volk can select this option or the SSE option, depending on which is faster. On other platforms, the ORC SIMD compiler might provide a solution. If all else fails, Volk can fall back on the generic proto-kernel, which will always work. Just a note on ORC. ORC is a SIMD compiler library that uses a generic assembly-like language for SIMD commands. Based on the available SIMD architecture of a system, it will try and compile a good solution. Tests show that the results of ORC proto-kernels are generally better than the generic versions but often not as good as the hand-tuned proto-kernels for a specific SIMD architecture. This is, of course, to be expected, and ORC provides a nice intermediary step to performance improvements until a specific hand-tuned proto-kernel can be made for a given platform. See Volk on gnuradio.org for details on the Volk naming scheme. \section volk_alignment Setting and Using Memory Alignment Information For Volk to work as best as possible, we want to use memory-aligned SIMD calls, which means we have to have some way of knowing and controlling the alignment of the buffers passed to gr_block's work function. We set the alignment requirement for SIMD aligned memory calls with: \code const int alignment_multiple = volk_get_alignment() / output_item_size; set_alignment(std::max(1,alignment_multiple)); \endcode The Volk function 'volk_get_alignment' provides the alignment of the the machine architecture. We then base the alignment on the number of output items required to maintain the alignment, so we divide the number of alignment bytes by the number of bytes in an output items (sizeof(float), sizeof(gr_complex), etc.). This value is then set per block with the 'set_alignment' function. Because the scheduler tries to optimize throughput, the number of items available per call to work will change and depends on the availability of the read and write buffers. This means that it sometimes cannot produce a buffer that is properly memory aligned. This is an inevitable consequence of the scheduler system. Instead of requiring alignment, the scheduler enforces the alignment as much as possible, and when a buffer becomes unaligned, the scheduler will work to correct it as much as possible. If a block's buffers are unaligned, then, the scheduler sets a flag to indicate as much so that the block can then decide what best to do. The next section discusses the use of the aligned/unaligned information in a gr_block's work function. \section volk_work Using Alignment Properties in Work() The buffers passed to work/general_work in a gr_block are not guaranteed to be aligned, but they will mostly be aligned whenever possible. When not aligned, the 'is_unaligned()' flag will be set. So a block can know if its buffers are aligned and make the right decisions. This looks like: \code int gr_some_block::work (int noutput_items, gr_vector_const_void_star &input_items, gr_vector_void_star &output_items) { const float *in = (const float *) input_items[0]; float *out = (float *) output_items[0]; if(is_unaligned()) { // do something with unaligned data. This can either be a manual // handling of the items or a call to an unaligned Volk function. volk_32f_something_32f_u(out, in, noutput_items); } else { // Buffers are aligned; can call the aligned Volk function. volk_32f_something_32f_a(out, in, noutput_items); } return noutput_items; } \endcode \section volk_tuning Tuning Volk Performance VOLK comes with a profiler that will build a config file for the best SIMD architecture for your processor. Run volk_profile that is installed into $PREFIX/bin. This program tests all known VOLK kernels for each architecture supported by the processor. When finished, it will write to $HOME/.volk/volk_config the best architecture for the VOLK function. This file is read when using a function to know the best version of the function to execute. \subsection volk_hand_tuning Hand-Tuning Performance If you know a particular architecture works best for your processor, you can specify the particular architecture to use in the VOLK preferences file: $HOME/.volk/volk_config The file looks like: \code volk_ \endcode Where the "FUNCTION_NAME" is the particular function that you want to over-ride the default value and "ARCHITECTURE" is the VOLK SIMD architecture to use (generic, sse, sse2, sse3, avx, etc.). For example, the following config file tells VOLK to use SSE3 for the aligned and unaligned versions of a function that multiplies two complex streams together. \code volk_32fc_x2_multiply_32fc_a sse3 volk_32fc_x2_multiply_32fc_u sse3 \endcode \b Tip: if benchmarking GNU Radio blocks, it can be useful to have a volk_config file that sets all architectures to 'generic' as a way to test the vectorized versus non-vectorized implementations. */