docs/doxygen/other/metadata.dox


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330

/*! \page page_metadata Metadata Information

\section Introduction

Metadata files have extra information in the form of headers that
carry metadata about the samples in the file. Raw, binary files carry
no extra information and must be handled delicately. Any changes in
the system state such as sample rate or if a receiver's frequency are
not conveyed with the data in the file itself. Header of metadata
solve this problem.

We write metadata files using gr::blocks::file_meta_sink and read metadata
files using gr::blocks::file_meta_source.

Metadata files have headers that carry information about a segment of
data within the file. The header structure is described in detail in
the next section. A metadata file always starts with a header that
describes the basic structure of the data. It contains information
about the item size, data type, if it's complex, the sample rate of
the segment, the time stamp of the first sample of the segment, and
information regarding the header size and segment size.

Headers have two main tags associated with them:

- rx_rate: the sample rate of the stream.
- rx_time: the time stamp of the first item in the segment.

These tags were inspired by the UHD tag format.

The header gives enough information to process and handle the
data. One cautionary note, though, is that the data type should never
change within a file. There should be very little need for this, but
more importantly. GNU Radio blocks can only set the data type of their
IO signatures in the constructor, so changes in the data type
afterward will not be recognized.

We also have an extra header segment that is option. This can be
loaded up at the beginning by the user specifying some extra metadata
that should be transmitted along with the data. It also grows whenever
it sees a stream tag, so the dictionary will contain and key:value
pairs out of tags from the flowgraph.


\subsection types Types of Metadata Files

GNU Radio currently supports two types of metadata files:

- inline: headers are inline with the data in the same file.
- detached: headers are in a separate header file from the data.

The inline method is the standard version. When a detached header is
used, the headers are simply inserted back-to-back in the detached
header file. The dat file, then, is the standard raw binary format
with no interruptions in the data.


\subsection updating Updating Headers

While there is always a header that starts a metadata file, they are
updated throughout as well. There are two events that trigger a new
header. We define a segment as the unit of data associated with the
last header.

The first event that will trigger a new header is when enough samples
have been written for the given segment. This number is defined as the
maximum segment size and is a parameter we pass to the
file_meta_sink. It defaults to 1 million items (items, not
bytes). When that number of items is reached, a new header is
generated and a new segment is started. This makes it easier for us to
manipulate the data later and helps protect against catastrophic data
loss.

The second event to trigger a new segment is if a new tag is
observed. If the tag is a standard tag in the header, the header value
is updated, the header and current extras are written to file, and the
segment begins again. If a tag from the extras is seen, the value
associated with that tag is updated; and if a new tag is seen, a new
key:value pair are added to the extras dictionary.

When new tags are seen, we generate a new segment so that we make sure
that all samples in that segment are defined by the header. If the
sample rate changes, we create a new segment where all of the new
samples are at that new rate. Also, in the case of UHD devices, if a
segment loss is observed, it will generate a new timestamp as a tag of
'rx_time'. We create a new file segment that reflects this change to
keep the sample times exact.


\subsection implementation Implementation

Metadata files are created using gr::blocks::file_meta_sink. The
default behavior is to create a single file with inline headers as
metadata. An option can be set to switch to detached header mode.

Metadata file are read into a flowgraph using
gr::blocks::file_meta_source. This source reads a metadata file,
inline by default with a settable option to use detached headers. The
data from the segments is converted into a standard streaming
output. The 'rx_rate' and 'rx_time' and all key:value pairs in the
extra header are converted into tags and added to the stream tags
interface.


\section structure Structure

The file metadata consists of a static mandatory header and a dynamic
optional extras header. Each header is a separate PMT
dictionary. Headers are created by building a PMT dictionary
(pmt::pmt_make_dict) of key:value pairs, then the dictionary is
serialized into a string to be written to file. The header is always
the same length that is predetermined by the version of the header
(this must be known already). The header will then indicate if there
is an extra data to be extracted as a separate serialized dictionary.

To work with the PMTs for creating and extracting header information,
we use PMT operators. For example, we create a simplified version of
the header in C++ like this:

\code
  using namespace pmt;
  const char METADATA_VERSION = 0x0;
  pmt_t header;
  header = pmt_make_dict();
  header = pmt_dict_add(header, mp("version"), mp(METADATA_VERSION));
  header = pmt_dict_add(header, mp("rx_rate"), mp(samp_rate));
  std::string hdr_str = pmt_serialize_str(header);
\endcode

The call to pmt::pmt_dict_add adds a new key:value pair to the
dictionary. Notice that it both takes and returns the 'header'
variable. This is because we are actually creating a new dictionary
with this function, so we just assign it to the same variable.

The 'mp' functions are convenience functions provided by the PMT
library. They interpret the data type of the value being inserted and
call the correct 'pmt_from_xxx' function. For more direct control over
the data type, see PMT functions in pmt.h, such as
pmt::pmt_from_uint64 or pmt::pmt_from_double.

We finish this off by using pmt::pmt_serialize_str to convert the PMT
dictionary into a specialized string format that makes it easy to
write to a file.

The header is always METADATA_HEADER_SIZE bytes long and a metadata
file always starts with a header. So to extract the header from a
file, we need to read in this many bytes from the beginning of the
file and deserialize it. An important note about this is that the
deserialize function must operate on a std::string. The serialized
format of a dictionary contains null characters, so normal C character
arrays (e.g., 'char *s') get confused.

Assuming that 'std::string str' contains the full string as read from
a file, we can access the dictionary in C++ like this:

\code
  pmt_t hdr = pmt_deserialize_str(str);
  if(pmt_dict_has_key(hdr, pmt_string_to_symbol("strt"))) {
    pmt_t r = pmt_dict_ref(hdr, pmt_string_to_symbol("strt"), PMT_NIL);
    uint64_t seg_start = pmt_to_uint64(r);
    uint64_t extra_len = seg_start - METADATA_HEADER_SIZE;
  }
\endcode

This example first deserializes the string into a PMT dictionary
again. This will throw an error if the string is malformed and cannot
be deserialized correctly. We then want to get access to the item with
key 'strt'. As the next subsection will show, this value indicates at
which byte the data segment starts. We first check to make sure that
this key exists in the dictionary. If not, our header does not contain
the correct information and we might want to handle this as an error.

Assuming the header is properly formatted, we then get the particular
item referenced by the key 'strt'. This is a uint64_t, so we use the
PMT function to extract and convert this value properly. We now know
if we have an extra header in the file by looking at the difference
between 'seg_start' and the static header size,
METADATA_HEADER_SIZE. If the 'extra_len' is greater than 0, we know we
have an extra header that we can process. Moreover, this also tells us
the size of the serialized PMT dictionary in bytes, so we can easily
read this many bytes from the file. We can then deserialize and parse
this header just like the first.


\subsection header Header Information

The header is a PMT dictionary with a known structure. This structure
may change, but we version the headers, so all headers of version X
must be the same length and structure. As of now, we only have version
0 headers, which look like the following:

- version: (char) version number (usually set to METADATA_VERSION)
- rx_rate: (double) Stream's sample rate
- rx_time: (pmt::pmt_t pair - (uint64_t, double)) Time stamp (format from UHD)
- size: (int) item size in bytes - reflects vector length if any.
- type: (int) data type (enum below)
- cplx: (bool) true if data is complex
- strt: (uint64_t) start of data relative to current header
- bytes: (uint64_t) size of following data segment in bytes

The data types are indicated by an integer value from the following
enumeration type:

\code
enum gr_file_types {
  GR_FILE_BYTE=0,
  GR_FILE_CHAR=0,
  GR_FILE_SHORT=1,
  GR_FILE_INT,
  GR_FILE_LONG,
  GR_FILE_LONG_LONG,
  GR_FILE_FLOAT,
  GR_FILE_DOUBLE,
};
\endcode

\subsection extras Extras Information

The extras section is an optional segment of the header. If 'strt' ==
METADATA_HEADER_SIZE, then there is no extras. Otherwise, it is simply
a PMT dictionary of key:value pairs. The extras header can contain
anything and can grow while a program is running.

We can insert extra data into the header at the beginning if we
wish. All we need to do is use the pmt::pmt_dict_add function to insert
our hand-made metadata. This can be useful to add our own markers and
information.

The main role of the extras header, though, is as a container to hold
any stream tags. When a stream tag is observed coming in, the tag's
key and value are added to the dictionary. Like a standard dictionary,
any time a key already exists, the value will be updated. If the key
does not exist, a new entry is created and the new key:value pair are
added together. So any new tags that the file metadata sink sees will
add to the dictionary. It is therefore important to always check the
'strt' value of the header to see if the length of the extras
dictionary has changed at all.

When reading out data from the extras, we do not necessarily know the
data type of the PMT value. The key is always a PMT symbol, but the
value can be any other PMT type. There are PMT functions that allow us
to query the PMT to test if it is a particular type. We also have the
ability to do pmt::pmt_print on any PMT object to print it to
screen. Before converting from a PMT to it's natural data type, it is
necessary to know the data type.


\section Utilities

GNU Radio comes with a couple of utilities to help in debugging and
manipulating metadata files. There is a general parser in Python that
will convert the PMT header and extra header into Python
dictionaries. This utility is:

- gr-blocks/python/parse_file_metadata.py

This program is installed into the Python directory under the
'gnuradio' module, so it can be accessed with:

\code
from gnuradio.blocks import parse_file_metadata
\endcode

It defines HEADER_LENGTH as the static length of the metadata header
size. It also has dictionaries that can be used to convert from the
file type to a string (ftype_to_string) and one to convert from the
file type to the size of the data type in bytes (ftype_to_size).

The 'parse_header' takes in a PMT dictionary, parses it, and returns a
Python dictionary. An optional 'VERBOSE' bool can be set to print the
information to standard out.

The 'parse_extra_dict' is similar in that it converts from a PMT
dictionary to a Python dictionary. The values are kept in their PMT
format since we do not necessarily know the native data type.

A program called 'gr_read_file_metadata' is installed into the path
and can be used to read out all header information from a metadata
file. This program is just called with the file name as the first
command-line argument. An option '-D' will handle detached header
files where the file of headers is expected to be the file name of the
data with '.hdr' appended to it.


\section Examples

Examples are located in:

- gr-blocks/examples/metadata

Currently, there are a few GRC example programs.

- file_metadata_sink: create a metadata file from UHD samples.
- file_metadata_source: read the metadata file as input to a simple graph.
- file_metadata_vector_sink: create a metadata file from UHD samples.
- file_metadata_vector_source: read the metadata file as input to a simple graph.

The file sink example can be switched to use a signal source instead
of a UHD source, but no extra tagged data is used in this mode.

The file source example pushes the data stream to a new raw file while
a tag debugger block prints out any tags observed in the metedata
file. A QT GUI time sink is used to look at the signal as well.

The versions with 'vector' in the name are similar except they use
vectors of data.

The following shows a simple way of creating extra metadata for a
metadata file. This example is just showing how we can insert a date
into the metadata to keep track of later. The date in this case is
encoded as a vector of uint16 with [day, month, year].

\code
  from gruel import pmt
  from gnuradio import blocks

  key = pmt.pmt_intern("date")
  val = pmt.pmt_init_u16vector(3, [13,12,2012])

  extras = pmt.pmt_make_dict()
  extras = pmt.pmt_dict_add(extras, key, val)
  extras_str = pmt.pmt_serialize_str(extras)
  self.sink = blocks.file_meta_sink(gr.sizeof_gr_complex,
                                    "/tmp/metadat_file.out",
				    samp_rate, 1,
				    blocks.GR_FILE_FLOAT, True,
				    1000000, extra_str, False)

\endcode

*/