summaryrefslogtreecommitdiff
path: root/docs/doxygen/other/volk_guide.dox
blob: 24882ed1a6ff5c9a968edf5ae1414ef99ec52bbb (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
/*! \page volk_guide Instructions for using Volk in GNU Radio

\section volk_intro Introduction

Volk is the Vector-Optimized Library of Kernels. It is a library that
contains kernels of hand-written SIMD code for different mathematical
operations. Since each SIMD architecture can be greatly different and
no compiler has yet come along to handle vectorization properly or
highly efficiently, Volk approaches the problem differently. For each
architecture or platform that a developer wishes to vectorize for, a
new proto-kernel is added to Volk. At runtime, Volk will select the
correct proto-kernel. In this way, the users of Volk call a kernel for
performing the operation that is platform/architecture agnostic. This
allows us to write portable SIMD code.

Volk kernels are always defined with a 'generic' proto-kernel, which
is written in plain C. With the generic kernel, the kernel becomes
portable to any platform. Kernels are then extended by adding
proto-kernels for new platforms in which they are desired.

A good example of a Volk kernel with multiple proto-kernels defined is
the volk_32f_s32f_multiply_32f_a. This kernel implements a scalar
multiplication of a vector of floating point numbers (each item in the
vector is multiplied by the same value). This kernel has the following
proto-kernels that are defined for 'generic,' 'avx,' 'sse,' and 'orc.'

\code
    void volk_32f_s32f_multiply_32f_a_generic
    void volk_32f_s32f_multiply_32f_a_sse
    void volk_32f_s32f_multiply_32f_a_avx
    void volk_32f_s32f_multiply_32f_a_orc
\endcode

These proto-kernels means that on platforms with AVX support, Volk can
select this option or the SSE option, depending on which is faster. On
other platforms, the ORC SIMD compiler might provide a solution. If
all else fails, Volk can fall back on the generic proto-kernel, which
will always work.

Just a note on ORC. ORC is a SIMD compiler library that uses a generic
assembly-like language for SIMD commands. Based on the available SIMD
architecture of a system, it will try and compile a good
solution. Tests show that the results of ORC proto-kernels are
generally better than the generic versions but often not as good as
the hand-tuned proto-kernels for a specific SIMD architecture. This
is, of course, to be expected, and ORC provides a nice intermediary
step to performance improvements until a specific hand-tuned
proto-kernel can be made for a given platform.

See <a
href="http://gnuradio.org/redmine/projects/gnuradio/wiki/Volk">Volk on
gnuradio.org</a> for details on the Volk naming scheme.


\section volk_alignment Setting and Using Memory Alignment Information

For Volk to work as best as possible, we want to use memory-aligned
SIMD calls, which means we have to have some way of knowing and
controlling the alignment of the buffers passed to gr_block's work
function. We set the alignment requirement for SIMD aligned memory
calls with:

\code
  const int alignment_multiple =
    volk_get_alignment() / output_item_size;
  set_alignment(alignment_multiple);
\endcode

The Volk function 'volk_get_alignment' provides the alignment of the
the machine architecture. We then base the alignment on the number of
output items required to maintain the alignment, so we divide the
number of alignment bytes by the number of bytes in an output items
(sizeof(float), sizeof(gr_complex), etc.). This value is then set per
block with the 'set_alignment' function.

Because the scheduler tries to optimize throughput, the number of
items available per call to work will change and depends on the
availability of the read and write buffers. This means that it
sometimes cannot produce a buffer that is properly memory
aligned. This is an inevitable consequence of the scheduler
system. Instead of requiring alignment, the scheduler enforces the
alignment as much as possible, and when a buffer becomes unaligned,
the scheduler will work to correct it as much as possible. If a
block's buffers are unaligned, then, the scheduler sets a flag to
indicate as much so that the block can then decide what best to
do. The next section discusses the use of the aligned/unaligned
information in a gr_block's work function.


\section volk_work Using Alignment Properties in Work()

The buffers passed to work/general_work in a gr_block are not
guaranteed to be aligned, but they will mostly be aligned whenever
possible. When not aligned, the 'is_unaligned()' flag will be set. So
a block can know if its buffers are aligned and make the right
decisions. This looks like:

\code
int
gr_some_block::work (int noutput_items,
		     gr_vector_const_void_star &input_items,
		     gr_vector_void_star &output_items)
{
  const float *in = (const float *) input_items[0];
  float *out = (float *) output_items[0];

  if(is_unaligned()) {
    // do something with unaligned data. This can either be a manual
    // handling of the items or a call to an unaligned Volk function.
    volk_32f_something_32f_u(out, in, noutput_items);
  }
  else {
    // Buffers are aligned; can call the aligned Volk function.
    volk_32f_something_32f_a(out, in, noutput_items);
  }

  return noutput_items;
}
\endcode



\section volk_tuning Tuning Volk Performance

VOLK comes with a profiler that will build a config file for the best
SIMD architecture for your processor. Run volk_profile that is
installed into $PREFIX/bin. This program tests all known VOLK kernels
for each architecture supported by the processor. When finished, it
will write to $HOME/.volk/volk_config the best architecture for the
VOLK function. This file is read when using a function to know the
best version of the function to execute.

\subsection volk_hand_tuning Hand-Tuning Performance

If you know a particular architecture works best for your processor,
you can specify the particular architecture to use in the VOLK
preferences file: $HOME/.volk/volk_config

The file looks like:

\code
    volk_<FUNCTION_NAME> <ARCHITECTURE>
\endcode

Where the "FUNCTION_NAME" is the particular function that you want to
over-ride the default value and "ARCHITECTURE" is the VOLK SIMD
architecture to use (generic, sse, sse2, sse3, avx, etc.). For
example, the following config file tells VOLK to use SSE3 for the
aligned and unaligned versions of a function that multiplies two
complex streams together.

\code
    volk_32fc_x2_multiply_32fc_a sse3
    volk_32fc_x2_multiply_32fc_u sse3
\endcode

\b Tip: if benchmarking GNU Radio blocks, it can be useful to have a
volk_config file that sets all architectures to 'generic' as a way to
test the vectorized versus non-vectorized implementations.

*/