Preamble

  • SignalML describes formats used for digital storage of (multivariate) biomedical time series.
  • SignalML is not a data format. Using SignalML description, data is read from the original files, without any conversion or file multiplication.
  • We can assume that SignalML does not have to provide a complete description of all the information contained in the data files, but it should describe its subset necessary and sufficient for a proper interpretation of the time series (e.g. display).
  • As one of the extensions from version 1.0, we allow the possibility of storing information about one recording in more than one file.

Describing data series

The content of data files can be divided into two logical parts

  1. The header containing meta-data—a description of the data contained in the file (sampling frequency, number of channels, electrode names, conversion constants, … and especially the format or physical layout of samples).
  2. The data part (raw numbers).

The header

Information that is normally contained in a header, either at the begging of the single file, or in some file separate from the main data file, and is required to understand the bulk of data, is converted into a series of ‚parameters’. The recipe to find this information is given using param tags.

param tags must be children of a file tag. In case of parameters to be read from a file, this implicitly specifies the file.

Parameters come in a few flavours:

  • either taking arguments (functions) or not (variables)
  • either evaluating an expression or reading data from a file

Irrespective of the specific requirements described below for different flavours of parameters, for each parameter the following must be given:

  • name (the id attibute)
    This must be a valid identifier as specified in #Identifiers. Identifiers of all parameters must be unique in one format description.
  • type (the type attibute)
    One of the types defined in Variable types.

The parameters are constant and idempotent, i.e. their evaluation always returns the same value (for the same arguments, in case of functions), and subsequent evaluation has no side effects.

An evaluation of a parameter can require other parameters because of references from attributes and expressions. A directed graph of such requirements must not contain cycles. In other words, an application using a format description can start with any parameter and find its value evaluating other parameters as needed.

Variables

Variables are parameters which have a constant value— it can only depend on other parameters (which are constant) and data read from files, which are constant too.[1]

A parameter is a variable when it has no arguments defined.

Functions

Functions are parameters which require arguments for evaluation. Their value is constant for each combination of arguments. In evaluating the function, arguments behave like local parameters with the value passed in the function call.

A parameter is a function when it has at least one argument defined.

An argument is specified with a name and a type. Arguments are specified as child tag nodes of the param node defining the function. Each arg must have the following attributes:

  • name
  • type

Argument names must be unique within the function and they must be valid identifiers.

Data-reading parameters

When the value of the parameter should be read from a file, child tags specify how to read this value must be given. What attributes are necessary depends on the type of file and is described in The data part.

This type of parameters cannot contain the expr child.

Evaluating parameters

Parameters whose value is to be calculated from other parameters, (or is a numerical constant), can specify one of the following. Either it can have the child <expr> node, whose contents are then evaluated according to the rules in Expressions.

Standard parameters

Parameters listed below have a specified meaning. Some parameters are prerequisite for understanding/interpretation of the (multivariate) time series data. When there is no default value that can be assumed to be usually right, then they are required to be present in each format description. They can be evaluating or not.

It is preferable to describe the information present in each format as completely as possible, but this might not be feasible, and is not required for basic interpretation. Therefore, we define a minimum set of parameters:

number_of_channels [required]
The width of each sample, that is, number of time series (channels, derivations) recorded simultaneously.
Type: int
mapping(channel, sample) [required]
The mapping specifying the layout of data. See #Mapping Mapping.
Type: int
sampling_frequency(channel) [optional]
The sampling frequency of the given channel.
Type: float
Units required
calibration_gain(channel) [optional]
Constant by which numbers from file are multiplied to get a physical value.
Type: float
Default value: 1
See #Calculating sample value.
calibration_offset(channel) [optional]
Constant by which numbers from file are diminished to get a physical value.
Type: float
Default value: 0
See #Calculating sample value.
calibration_units(channel) [optional]
Physical units which this channel uses (usually μV or fT).
Type: string
This string is understood as described in #Parsing units.
samples_in_file(channel) [optional]
The length of the time series in samples, for the given channel. In case of most formats, the result does not depend on the channel number, and the function argument can be ignored. EDF is one of the rare formats where each channel can have a different number of samples, and the function argument is necessary.
If this parameter is not defined, the only way to know the number of samples is from file size.
Type: int
channel_name(channel) [optional]
Name of each channel.
Type: string
Default value: Lchannel where channel is substituted with the function argument.

Calculating sample value

Samples are often stored after a linear transformation. Therefore, two standard parameters are specified, which are then used to calculate the real value of a sample.


  final_value(channel, sample) =
    (sample_as_stored_in_file(channel,sample) - calibration_offset(channel)) 
    * calibration_gain(channel)

Neither of the two parameters must be defined. They have default values 0 and 1, which means that the above formula defaults to


  final_value(channel, sample) = sample_as_stored_in_file(channel,sample)

The data part

File types

SignalML 2.0 can describe formats where data is stored in files of one of the following types:

  • Fixed position
  • XML
  • Free text
Fixed-position files

Files based on fields whose width is defined a priori, so that some field can be located by seek()ing on the file and read()ing from a known location, are called in this document fixed-position files.[2] They are also commonly called ‚binary’ files, but this is imprecise: e.g. EDF contains constant-width, fixed-position data formatted as ASCII strings, therefore not binary.

To retrieve a field present in a file of this type, the following data is necessary:

input format
This describes how the data is stored in the file, or more precisely, how many bytes are used, and how they should be interpreted.
This description is understood using the rules of a dtype definition, as defined in the NumPy array interface.
output type
Whatever is read, is converted to this type. It must be one of the types defined in #variable_types.
If output type is not specified, it defaults to the same generic type as the input format, albeit without explicit width for int and float types.
position in file
This tells where in the file this variable is located. It is the offset from the begging of file in bytes.

This file type is specified by <file type='binary'>.

XML files

XML files are used more often for headers rather than data, but it is certainly possible to use XML for storage of both data parts. No validation is performed.

To retrieve a variable in an XML file the following information is necessary:

output type
The same as in fixed-offset files.
location specified as an XPath
This xpath is used to retrieve some string-value, which is in turn interpreted as a text representation of the output type.

This file type is specified by <file type='XML'>.

Text files

Text files are composed of ‚lines’ separated by end-of-line markers, which in turn are divided into ‚fields’, and the position of n-th sample can only be found by sequential parsing.

Because there are many, many different formats of text files, we do not define the precise format. Instead, the file is split into lines using the end-of-line marker, defined as a regular expression.

Each line can be split into fields, using a seperator regexp, defined as the attribute split on the file.

To retrieve a variable one of the following must be used:

Line number and field number
     
     <file split="/ +/">     (split at whitespace)
                             (extract third field on the first line)
     <param id="number_of_channels" line="1" field="3" />

Line number and a regexp to extract the variable value

The regexp must be written in such a way, that the one and only capturing group matches the value.


                              (a string after the first colon in the first line)
     <param id="number_of_channels" line="1" match="/^[^:]*:(.+)$/" />

A regexp that matches against the whole file

The regexp must be written in such a way, that the one and only capturing group matches the value.


                              (a line that starts with MSI.TotalChannels, part after colon)
     <param id="number_of_channels" line="any"
            match="/^MSI.TotalChannels:\s*(\d+)\s*$/" />

This file type is specified by <file type='text'>.

Data tag and offset and file mapping

The presence of data in a file is signified by a <data> tag. There must be no more than one tag of this kind, but no data can be extracted from a format unless there is at least one. The file in which the data is contained can be specified either explicitly or implicitly.

If the element contains an attribute file, then this must be a name of a parameter giving the ID of file to use. IDs of files are specified through the id attribute. If follows, that if the attribute was not used for a file, it cannot be explicitly referenced in this way.

If the file attribute of the <data> element is not used, the <data> element must be nested inside a <file> element. The enclosing file is then implicitly taken to be the file containing the data.

The layout of data is specified through a function given through the offset attribute. The function must return the position (offset in bytes from the begging of file) of the requested measurement.

For example, for file called test.dat with multiplexed data, one could write

 <file name='test.dat'>
   <data offset='multiplex_mapping'>
   <param 'multiplex_mapping'> … </param>
 </file>

The parameter specified through the file and offset attributes must be functions taking two int arguments, specifying sample and channel number, starting from 0.

Example: multiplexed

Multiplexed samples are arranged as
(sample0:channel0 sample0:channel1 …
sample1:channel0 sample1:chanenl1 …
sampleN:channel0 sampleN:channel1 … sampleN:channelM)

The relevant mapping function is

<param id='mapping' type='int'>
  <arg name='channel' type='int'> (channel number)
  <arg name='sample' type='int'> (sample number)
  <expr>
     (sample * number_of_channels + channel) * datatype_width + header_size
  </expr>
</param>
Example: EDF

Samples are aranged in frames. To understand the layout, a helper parameter is used, channel_offset(channel), which specifies how far into each frame this channels data is stored. Another helper parameter used, frame_size specifies the size of each frame in bytes.

The relevant mapping function is

 <param id='mapping' type='int'>
   <arg type='int' name='channel' />
   <arg type='int' name='sample' />
   <expr>
     sample//samples_per_frame(channel) * frame_size + 
     channel_offset(channel) + 
     sample%samples_per_frame(channel) * datatype_width
   </expr>
 </param>
 <param id='channel_offset' type='int'>
    <arg type='int' name='channel' />
    <expr>
      channel == 0 ? 0 :
      channel_offset(channel-1) + samples_per_frame(channel-1) * datatype_width
   </expr>
 </param>
 <param id='frame_size' type='int'>
    <expr>channel_offset(number_of_channels + 1)</expr>
 </param>

Accessing files

Files are referenced through <file> elements.

Sometimes the file name is empty, i.e. an str with length 0, or simply not specified (in case of files defined through <file> elements). The application can cope with this situation in two ways:

  1. If it is the ‚main’ file, then the standard sequence of events is such, that the user specifies some filename to open, and this filename is used for the ‚main’ file.
  2. If the filename wasn’t specified, the user can be queried or an error can be signaled.

Standard functions

The following functions are defined by the specification and are available in all SignalML implementations.

Exponential functions

log(x) returns logex
log(x) returns log10x
exp(x) returns ex
factorial(x) returns x!

Trigonometric functions

The argument is interpreted as an angle in radians.

sin(x) returns sinx
cos(x) returns cosx
tan(x) returns tanx
cot(x) returns cotx

String functions

strip(s) returns the string with whitespace removed from the begging and end. To be considered whitespace, characters must be defined so in Unicode.

split(s, sep) returns an list of words in the string s, using sep as seperator.

Special functions

protocol_version gives the SignalML version?

throw(message) is used to return an error to the application. The message is a string intended to be understood by the user that describes the error.

Variable types

Variable types are used for output from the codec to the surrounding application.

The following types are defined:

int a signed integral number
float a floating-point number
bool a boolean variable
str an array of Unicode characters
bytes an array of one-byte characters

These types are based on Python. However, they are not required to have unlimited range. It is at implementations discretion to use native integer or float types of sufficient range.

Additionally, arrays can be defined as int[], float[], etc. The length of the array is not defined at the time of declaration.

Those types are defined to describe how the application communicates with the codec at the logical level. However, the implementation defines what native types are used, and e.g. the data declared as float can be really present in memory as float, but the sampling frequency, also declared as float, can be stored in memory as a double float.

Expressions

Expressions are used in a number of places:

  1. In evaluating parameters — value of the parameter is found by executing the expression contained in an <expr> node.
  2. Some attributes are interpreted as expressions and executed.

The interpretation of an expression is roughly based on Python syntax and evaluation rules, including precedence.

Expressions can contain parameter references — variable references and function calls. A name used in an expression, will, in order or precedence,

  1. refer to a local argument name (in a function),
  2. refer to a parameter,
  3. refer to a built-in parameter,
  4. cause a failure.

variable references

some_name

function calls

some_func(param1, param2)

constants

  • integral numbers (e.g. 123)
  • floating-point numbers (e.g. 23.12)
  • numbers with an explicit radix (e.g. 0x200, 0o755, 0b00110011)

operators

  • + (addition), -(subtraction), *(multiplication)
  • / (division) and // integral division[3]
  • % (modulo)[4]
  • ==, <, <=, >=, >, != (comparisons)
  • & (bitwise and), | (bitwise or), ^ (bitwise exclusive or), << (bitwise left shift), >> (bitwise right shift)
  • [start:stop:stride] (slicing)
  • predicate?if-true:if-false (ternary operator)
  • and, or, not, xor (logical operators)

identifiers

Parameter references and function calls are performed through identifiers. Identifiers must satisfy the regexp /[a-zA-Z_][a-zA-Z_0-9]*/, that is be acceptable identifiers in Python, C, Java…

Example: multiplex

The offset of multiplexed channel sample in a binary file can be written as

(number_of_channels * sample_number + channel_number) * datatype_width

Units

To attach physical units to some parameter the attribute units= must be used.
E.g., to specify a sampling frequency of 100 Hz, one could use

 
    <param id='sampling_frequency' units='Hz'>
       <expr>100</expr>
    </param>
 

Each parameter is one of the following states in respect to units:

  • undefined – an operation was performed which makes no sense when using units
  • with units – some unit is attached to the parameter
  • unitless – a special case of the above, the parameter is a scalar in units of 1.

Once a unit is set, it propagates according to the following rules (in order of importance)

  1. Explicitly setting units with the attribute overrides other rules.
  2. The result of a bit operation is unitless.
  3. The result of any operation with an operand in undefined unit state is in undefined unit state.
  4. The result of * or / is in the product or quotient of units of operands.
  5. The result of +, -, %, // is in the same units as either operand if they are in the same units, or undefined if they are not.
  6. The result of comparison operators is unitless if the operands are in the same units, or undefined if the are not.
  7. The result of slicing is in the same units as the slicee.

Parsing units

The string specifying units must be written as a space- (for multiplication) and slash- (for division) and double star- (**, for raising to a power) -separated sequence of unit symbols. The precedence of those operators is the same as in expressions.

Base units and prefixes must be used as specified by Bureau International des Poids et Mesures

Greek letters in prefixes and other special letters must be written as-is, using an appropriate encoding (preferably some Unicode serialization) or entity references.

Units should be specified as a product or a quotient only where the unit has no commonly accepted symbol.

Examples:

  1. μV or mV or V for microvolt or millivolt or volt
  2. fT for femtotesla
  3. Å or nm or μm for angstrom or nanometer or micrometer
  4. m T/s for meter tesla per second (whatever that is)
  5. m**3 for cubic meters

Verification

Verification of a SignalML format description can be performed on multiple levels:

  1. concordance with the corresponding XMLSchema.
    The canonical location of the schema is http://signalml.org/SignalML_2_0.xsd
  2. syntactic correctness of the algebraic expressions (contained within attributes, expr tags).
  3. conformance to other requirements set in this specification:
    1. all required parameters are defined
    2. expression don’t reference non-existent parameters
    3. parameters have required types
  4. successful execution of the actual data reading and evaluation operations (runtime correctness)
  5. fulfillment of the assertions specified in format description as asserts.

Points one and two and three depend only on the format description. Not satisfying points four or five however, can be caused by errors in the description or by the data-files not conforming to the description.

XML structure

<?xml version="1.0"?>
<format>
  <header>
      <format id='PE-EASYS'/>
  </header>

  <file extension='*.d' type='binary' >
     <param id='datatype_width'>
        <expr>4</expr>
     </param>
     <param id='mapping' type='int'>
        <expr>(number_of_channels * sample_number + channel_number) *
                  datatype_width + 16 * data_offset
        </expr>
     </param>

     <param id='magic'>
        <format>|S3</format>
        <offset>0</offset>
     </param>
     <assert id='magic_ok'>
         <expr>magic == "EAS"</expr>
     </assert>

     <param id='number_of_channels'>
         <format>>i1</format>
         <offset>16</offset>
     </param>

     <param id='sampling_frequency' type='float' units='Hz'>
         <expr>_sampling_frequency/100</expr>
     </param>
     <param id='_sampling_frequency'>
         <format>>i4</format>
         <offset>18</offset>
     </param>

     <param id='calibration_gain' type='float' units='μV'>
         <expr>_calibration_gain/100</expr>
     </param>
     <param id='_calibration_gain'>
         <format>>i1</format>
         <offset>25</offset>
     </param>

     <param id='data_offset'>
         <format>>i2</format>
         <offset>28</offset>
     </param>
  </file>
</format>

Allowed values of the type parameter (type of the file)

‚binary’ fixed-position with IEEE floating/fixed point
‚xml’ XML file
‚ascii’ an ASCII format

Format ID

It would be good to have a well defined formula for the identification and naming the very format, read/defined by the codec. For example, something like in the XML Schemas–a given URI. So for EDF that might be „http://www.edfplus.info/specs/edf.html„–some groups, organizations or companies might like to adhere to this kind of scheme, but even if not, the application might use this ID to create a list of handled formats and group codecs for each of them.

For this, each codec should return some text info, useful for presenting to the user, e.g.:

  • company/organization promoting given data format
  • name of the format
  • version of the format

Codec ID

It would be also good to have a unique formula for the identification/description of a codec, including e.g.:

  • data format which it describes/defines (e.g. the URI from the above example)
  • unique name of the codec(s) provider (sth. like the above URI, or Java packets notations, in which case EASYS codec that we create and maintain would be org.signalml.EASYS — with absolutely no relation to the Java class which implements it)
  • version number

Application should not allow to simultaneous installation of two versions of the same codec (according to the above criteria)—such an attempt should be processed as an upgrade.

Currently, the signalml application uses for this hash of the XML file, from which the codec was created: that’s a bit unnatural, and, in general, a codec does not have to originate from an XML file to be functional.

Magic

From the side of application and users, it would be great to include some kind of „AI” guessing and suggesting the right format/codec based upon the file name and content. So, for the formats that include some kind of „magic” identification fields, we could include this information in the above description/ID.

References

P.J. Durka and D. Ircha (2004) SignalML: metaformat for description of biomedical time series , Computer Methods and Programs in Biomedicine Volume 76, Issue 3, pp. 253-259

B. Kemp (2004) SignalML from an EDF+ perspective. Computer Methods and Programs in Biomedicine, 76(3):261–263, 2004.

Notes

  1. the case of files changing is not taken into account
  2. This includes files with a header: the header is parsed to find the offset to begging of data, and subsequent access to data uses this fixed offset + some relative offset calculated from channel number and sample number.
  3. Note that / is normal floating-point division, that is as if a recent-enough Python was used (>=3.0), or from __future__ import division was executed in older versions.
  4. Note modulo is not the same as remainder, e.g. -3 % 2 == 1.