-
Notifications
You must be signed in to change notification settings - Fork 17
HDF5 smFRET format Draft 0.2 (WIP)
This documents describes the "Draft 0.2" of the HDF5-smFRET file format.
NOTE: This is currently a work-in-progress. Comments and suggestions are encouraged from all the interested parties.
This document contains the specifications for the HDF5-Ph-Data format. This format allows saving single-molecule spectroscopy experiments when there is at least a stream of photon timestamps. It has been envisioned as a standard container format for a broad range of experiments involving confocal microcopy. Notable examples are confocal smFRET experiments (with or without laser alternation) either with a single or with multiple excitation spots. It can store ns-ALEX or FCS measurements.
- Assure the long term persistence of the data
- Space and speed efficient format for daily use
- Ease sharing of datasets and interoperability between analysis programs
- Open-standard and wide-spread used format with opensource implementations (HDF5)
- Efficient: the HDF5 format is a binary format that allows compression and is fast to read and write
- Flexible: data arrays can be stored in "groups" (hierarchical format). Metadata can be attached to each data entry (attributes). No limit in data size. Support for a variety of numeric and non-numeric data types.
The main design principles we follow are
- Simplicity
- Flexibility
- Compatibility
We aim at defining a format that is "small", easy to implement, efficient and expandable while maintaining compatibility.
To achieve "simplicity" we only require the general file layout and the presence of a few basic attributes and parameters. The remaining (small set of) fields here defined will be present only when they will be needed by a particular measurement.
We retain flexibility by allowing the user to save any arbitrary data outside the specs of this document. To assure that a future version of this format will not clash with some user-defined fields, we require that all the user-defined field be contained in groups named user
.
The root node need to include the following attributes:
- format_name = 'HDF5-Ph-Data',
- format_title = 'HDF5-based format for time-series of photon data.',
- format_version = '0.2'
##Data fields
In the root group:
-
timestamps_unit
: (float) time in seconds of 1-unit increment in timestamps. Normally, timestamps are integers and the unit increment is determined by the acquisition electronics. However, timestamps can also be floats and express the time in seconds. In this casetimestamps_unit
will set to 1. -
num_spots
: (integer) Normally it is 1 for single-spot measurements. In multi-spot measurements contains the number of excitation or detection spots. -
ALEX
: (boolean) if True the measurements uses alternated excitation. -
lifetime
: (boolean) if True the data contains nanotimes from TCSPC hardware
OPEN QUESTION: How to handle the case of 2 laser excitation and only 1 laser alternation?
The following parameters are mandatory for us-ALEX data. In the root group:
-
alternation_period
(integer or float): the duration of the excitation alternation using the same units as the timestamps. The alternation period in seconds is obtained by multiplyingalternation_period
bytimestamps_unit
. This field is present only for ALEX data.
The following parameters are optional for us-ALEX data. In the root group:
-
alex_donor_period
: (2-elements array, ints): The start and stop values identifying the donor emission period. Used only in us-ALEX measurements. -
alex_acceptor_period
: (2-elements array, ints): The start and stop values identifying the acceptor emission period. Used only in us-ALEX measurements.
NOTE: The fields
donor_alex_on
andacceptor_alex_on
allow to obtain selections of photons detected during donor or acceptor excitation. As an example let define the arrayA
= "timestamps
MODULOalternation_period
" and call the values indonor_alex_on
andacceptor_alex_on
(start
,stop
). A selection of photon emitted during the donor (acceptor) period is obtained by applying one of these two conditions:
A > donor_alex_on[0] and A < donor_alex_on[1]
whendonor_alex_on[0] < donor_alex_on[1]
(internal range)A > donor_alex_on[0] or A < donor_alex_on[1]
whendonor_alex_on[0] > donor_alex_on[1]
(external range).
In the root group:
-
nanotimes_unit
: (float) time in seconds of the TCSPC bin. Note that, as opposed to timestamps, nanotimes are required to be integers (the raw values provided by the TCSPC board).
The "photon-data" is any "per-photon" piece of information. For example, timestamps or detector numbers are photon-data.
All the photon-data is contained in a group named /photon_data
. This group contains different arrays, one for each type of data. All the arrays in this group have the same length (or, in general, the same number of rows) equal to the number of photons in a measurement.
The arrays in this folder can only have a pre-defined set of names (corresponding to specific quantities here defined). Any other photon-data that is not defined here can be saved as an array inside the user
group (specifically '/photon_data/user/'). This requirement is needed to assure forward-compatibility of user-defined fields with future versions of this format.
Timestamps and corresponding detectors:
-
timestamps
: (array int or float) contains all the recorded timestamps
-
detectors
: (array of integers) contains the detector number for each timestamp intimestamps
. Each physical detector (for example donor and acceptor channels) needs to have a unique label (a positive integer including zero). For example, measurements of smFRET and polarization anisotropy with a single donor-acceptor pair have 4 detectors, and it needs 4 different labels.
NOTE: the
detectors
array is optional if and only if there is only a single detector or only one detector per spot.
-
nanotime
(array of int) contains the TCSPC nanotimes. This array is only required if/lifetime
is True. -
particles
: particle label (number) for each timestamp. This optional array is used when the data comes from a simulation that provides the particle information.
Arrays in the photon_data
group can have additional associated information that is not "photon-data" (i.e. is not an array with one element per photon). This data is stored in a group with a _specs
suffix.
If there is data about which detectors is donor/acceptor and/or parallel/perpendicular polarization, we then the following arrays must be used:
This fields are defined inside detectors_specs
:
-
donor
: (array of ints) list of detectors for the donor channel. A standard smFRET measurement will have only one value. A smFRET with polarization (4 detectors) will have 2 values. For a multi-spot measurement it will contain the list of donor-channel detectors. The order matters. -
acceptor
: (array of ints) list of detectors for the acceptor channel. A standard smFRET measurement will have only one value. A smFRET with polarization (4 detectors) will have 2 values. For a multi-spot measurement it will contain the list of acceptor-channel detectors. The order matters. -
polariz_paral
(array of ints) list of detectors for the parallel polarization. -
polariz_perp
(array of ints) list of detectors for the perpendicular polarization.
Additional specs can be saved in detectors_specs/user/
.
NOTE 1: If only a single spectral channels is acquired the detector(s) can be put in either
detectors_donor
ordetectors_acceptor
but not in both. These arrays may be omitted when not relevant.
NOTE 2: If no polarization selection is performed in the detection path the polarization fields should be omitted. If only one polarization is acquired the detector number should go either in
polariz_paral
orpolariz_perp
, but not in both.
When the nanotimes
array is present it is required to provide also the following specs:
-
tcspc_bin
: (float) TCSPC bin-size in seconds. The same as/nanotime_unit
. -
tcspc_nbins
: (int) TCSPC number of bins. -
tcspc_range
: (float) Full-scale range of the TCSPC hardware.
QUESTION 1 Should we keep both
tcspc_bin
and/nanotime_unit
even though they contain the same number?
QUESTION 2 In principle
tcspc_range
istcspc_bin*tcspc_nbins
. It is redundant?
TENTATIVE ANSWER: In both cases we are talking of a single float. It may be just more convenient to have this minimal redundancy that will ease reading the data.
Optionally, if data comes from simulations it can contain the following specs:
-
tau_accept_only
: (float) Intrinsic Acceptor lifetime in seconds. -
tau_donor_only
: (float) Intrinsic Donor lifetime. -
tau_fret_donor
: (float) Donor lifetime in presence of Acceptor. -
tau_fret_trans
: (float) FRET energy transfer lifetime. Inverse of the rate of DA -> DA.
Additional specs can be saved in nanotime_specs/user/
.
Multi-spot measurements can be saved using the "basic layout" described in previous section. In this case the timestamps
array contains all the timestamps from all the channels and the detectors
allows to discriminate between the pixels in the detector array. In case of smFRET measurements the detectors_specs
donor
and acceptor
contains an ordered list of detector numbers, whose length is the number of spots.
However reading multi-spot data from a basic layout is inefficient because to extract the photon-data of each single channels all the timestamps
and detectors
must be read. For this reason for multi-spot data we define an additional layout called "multi-spot layout".
The "multi-spot layout" is identical to the basic layout for single-spot data. The only difference is that instead of having a single group /photon_data
we have N groups /photon_data_0
.. /photon_data_N
one for each spot. Each group has a suffix indicating the spot number (starting from 0).
The HDF5-Ph-Data defines an optional "sample" section where information about the measured sample can be stored. This data is stored in the group /sample_specs
.
Within /sample_specs
the following fields are defined:
-
num_dyes
: (int) number of different dyes present in the samples. For a standard single-pair FRET measurement the value is 2. For donor-only or acceptor-only measurements the value should be 1. -
dye_names
(list of string) list of dye names (for example:['ATTO550', 'ATTO647N']
). -
buffer_name
(string) free-form description of the sample buffer. For example'TE50 + 1mM of TROLOX'
. -
sample_name
(string) free-form description of the sample. For example'40-bp dsDNA, D-A distance: 7-bp'
.
The optional group /setup_specs
contains fields describing the measurement setup:
-
excitation_wavelengths
: (array of floats) array of all the excitation wavelengths in S.I. units (meters). -
excitation_powers
(array of float) array of excitation powers (in the same order asexcitation_wavelengths
). The powers are expressed in S.I. units (Watts).
Unlimited user-defined fields are allowed. To make sure that future versions of this format will not use any user-defined field names, all the custom data should be contained in a group named user
. A user
group can be placed anywhere in the HDF5 hierachy and should be place wherever it is more logical for the kind of data stored. Just as an example, user-data can be stored in '/user'
, '/photon_data/user'
, '/photon_data/nanotimes_specs/user'
, '/setup_specs/user'
.