HDF5 smFRET format Draft 0.2 (WIP)

This documents describes the "Draft 0.2" of the HDF5-smFRET file format.

NOTE: This is currently a work-in-progress. Comments and suggestions are encouraged from all the interested parties.

Introduction

Overview

This document contains the specifications for the HDF5-Ph-Data format. This format allows saving single-molecule spectroscopy experiments when there is at least a stream of photon timestamps. It has been envisioned as a standard container format for a broad range of experiments involving confocal microcopy. Notable examples are confocal smFRET experiments (with or without laser alternation) either with a single or with multiple excitation spots. It can store ns-ALEX or FCS measurements.

What problems we try to solve?

Assure the long term persistence of the data
Space and speed efficient format for daily use
Ease sharing of datasets and interoperability between analysis programs

Features of HDF5

Open-standard and wide-spread used format with opensource implementations (HDF5)
Efficient: the HDF5 format is a binary format that allows compression and is fast to read and write
Flexible: data arrays can be stored in "groups" (hierarchical format). Metadata can be attached to each data entry (attributes). No limit in data size. Support for a variety of numeric and non-numeric data types.

HDF5-Ph-Data: Design principles

The main design principles we follow are

Simplicity
Flexibility
Compatibility

We aim at defining a format that is "small", easy to implement, efficient and expandable while maintaining compatibility.

To achieve "simplicity" we only require the general file layout and the presence of a few basic attributes and parameters. The remaining (small set of) fields here defined will be present only when they will be needed by a particular measurement.

We retain flexibility by allowing the user to save any arbitrary data outside the specs of this document. To assure that a future version of this format will not clash with some user-defined fields, we require that all the user-defined field be contained in groups named user.

HDF5-Ph-Data format definition

Metadata

The root node need to include the following attributes:

format_name = 'HDF5-Ph-Data',
format_title = 'HDF5-based format for time-series of photon data.',
format_version = '0.2'

##Data fields

Required parameters:

In the root group:

timestamps_unit: (float) time in seconds of 1-unit increment in timestamps. Normally, timestamps are integers and the unit increment is determined by the acquisition electronics. However, timestamps can also be floats and express the time in seconds. In this case timestamps_unit will set to 1.
number_confocal_spots: (integer) Normally it is 1 for single-spot measurements. In multi-spot measurements contains the number of excitation spots.
ALEX: (boolean) if True the measurements uses alternated excitation.
lifetime: (boolean) if True the data contains nanotimes from TCSPC hardware

OPEN QUESTION: How to handle the case of 2 laser excitation and only 1 laser alternation?

Required parameters when `ALEX`

In the root group:

alternation_period (integer or float): the duration of the excitation alternation using the same units as the timestamps. The alternation period in seconds is obtained by multiplying alternation_period by timestamps_unit. This field is present only for ALEX data.

Required parameters when `lifetime`

In the root group:

nanotimes_unit: (float) time in seconds of the TCSPC bin. Note that, as opposed to timestamps, nanotimes are required to be integers (the raw values provided by the TCSPC board).

Basic layout for "photon-data"

The "photon-data" is any "per-photon" piece of information. For example, timestamps or detector numbers are photon-data.

All the photon-data is contained in a group named /photon_data. This group contains different arrays, one for each type of data. All the arrays in this group have the same length (or, in general, the same number of rows) equal to the number of photons in a measurement.

The arrays in this folder can only have a pre-defined set of names (corresponding to specific quantities here defined). Any other photon-data that is not defined here can be saved as an array inside the user group (specifically '/photon_data/user/'). This requirement is needed to assure forward-compatibility of user-defined fields with future versions of this format.

Required Photon-data arrays

Timestamps and corresponding detectors:

timestamps: (array int or float) contains all the recorded timestamps

Optional photon-data arrays

detectors: (array of integers) contains the detector number for each timestamp in timestamps. Each physical detector (for example donor and acceptor channels) needs to have a unique label (a positive integer including zero). For example, measurements of smFRET and polarization anisotropy with a single donor-acceptor pair have 4 detectors, and it needs 4 different labels.

NOTE: the detectors array is optional if and only if there is only a single detector or only one detector per spot.

nanotime (array of int) contains the TCSPC nanotimes. This array is only required if /lifetime is True.
particles: particle label (number) for each timestamp. This optional array is used when the data comes from a simulation that provides the particle information.

Photon-data "specs"

Arrays in the photon_data group can have additional associated information that is not "photon-data" (i.e. is not an array with one element per photon). This data is stored in a group with a _specs suffix.

Detectors "specs"

If there is data about which detectors is donor/acceptor and/or parallel/perpendicular polarization, we then the following arrays must be used:

detectors_specs/donor: (array of ints) list of detectors for the donor channel. A standard smFRET measurement will have only one value. A smFRET with polarization (4 detectors) will have 2 values. For a multi-spot measurement it will contain the list of donor-channel detectors. The order matters.
detectors_specs/acceptor: (array of ints) list of detectors for the acceptor channel. A standard smFRET measurement will have only one value. A smFRET with polarization (4 detectors) will have 2 values. For a multi-spot measurement it will contain the list of acceptor-channel detectors. The order matters.
detectors_specs/polariz_paral (array of ints) list of detectors for the parallel polarization.
detectors_specs/polariz_perp (array of ints) list of detectors for the perpendicular polarization.

Additional specs can be saved in detectors_specs/user/.

NOTE 1: If only a single spectral channels is acquired the detector(s) can be put in either detectors_donor or detectors_acceptor but not in both. These arrays may be omitted when not relevant.

NOTE 2: If no polarization information is acquired these fields should be empty, or they can be omitted.

Nanotime "specs"

When the nanotimes array is present it is required to provide also the following specs:

tcspc_bin: (float) TCSPC bin-size in seconds. The same as /nanotime_unit.
tcspc_nbins: (int) TCSPC number of bins.
tcspc_range: (float) Full-scale range of the TCSPC hardware.

QUESTION 1 Should we keep both tcspc_bin and /nanotime_unit even though they contain the same number?

QUESTION 2 In principle tcspc_range is tcspc_bin*tcspc_nbins. It is redundant?

TENTATIVE ANSWER: In both cases we are talking of a single float. It may be just more convenient to have this minimal redundancy that will ease reading the data.

Optionally, if data comes from simulations it can contain the following specs:

tau_accept_only: (float) Intrinsic Acceptor lifetime in seconds.
tau_donor_only: (float) Intrinsic Donor lifetime.
tau_fret_donor: (float) Donor lifetime in presence of Acceptor.
tau_fret_trans: (float) FRET energy transfer lifetime. Inverse of the rate of DA -> DA.

Additional specs can be saved in nanotime_specs/user/.

Timestamps and detector: multi-spot layout (TO BE UPDATED)

In multi-spot measurements the basic layout can be used. However to reduce RAM requirements and speed-up the reading time it is convenient to store timestamps in different arrays, one for each spot.

In this case we have a group /timestamps that contains a series of arrays:

ts_0, ts_1, ... ts_N (where N is the number of spots)

Each array contains all the timestamps (donor + acceptor) for the given spot.

The information about acquisition channel (i.e. donor or acceptor) for each timestamp is stored in a boolean mask for the acceptor channel (a timestamp is from the acceptor channel if the boolean is True). These boolean masks are a series of arrays in the group /acceptor_mask:

A_mask_0, A_mask_1, ... A_mask_N (where N is the number of spots)

Like for the /timestamps group there is one array per excitation spot. Each array in /acceptor_mask is the boolean mask for the corresponding array in /timestamps (for A_mask_0 -> ts_0, etc...).

When using the "multi-spot layout" (that can be used in principle also for single-spot data) the following fields specific of the "basic layout" should not be present:

timestamps_t
detectors_t
detectors_donor
detectors_acceptor
detectors_parallel_polarization
detectors_perpendicular_polarization

Sample fields

The group /sample_parameters contains the following fields describing the sample:

number_of_dyes: (int) number of different dyes present in the samples. For a standard single-pair FRET measurement the value is 2. For donor-only or acceptor-only measurements the value should be 1. Values larger than 2 are allowed but not currently covered in this document.
donor_dye (string) name of the donor dye, or empty string if no donor dye is present.
acceptor_dye (string) name of the acceptor dye, or empty string if no acceptor dye is present.
buffer (string) free-form description of the sample buffer. For example 'TE50 + 1mM of TROLOX'.

Measurement setup fields

The group /setup_parameters contains the following fields describing the measurement setup:

excitation_wavelength_donor: (float) excitation wavelength in S.I. units (meters) for the donor dye.
excitation_wavelength_acceptor: (float) excitation wavelength in S.I. units (meters) for the acceptor dye.

Optional fields (they may not exist):

excitation_power_donor (float) excitation power in S.I. units (W) for the donor dye.
excitation_power_acceptor (float) excitation power in S.I. units (W) for the acceptor dye.
detector_type (table): first column is the integer containing with the detector label. The second column is a 128-char string with detector name. For example 'MPD red-enhanced gen. 1'.

Additional fields

Any additional user-defined fields should be allowed. To make sure we can in the future use new names without conflicting with user-defined fields all the custom data should be contained in a specific group, named for example user_data.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HDF5 smFRET format Draft 0.2 (WIP)

Introduction

Overview

What problems we try to solve?

Features of HDF5

HDF5-Ph-Data: Design principles

HDF5-Ph-Data format definition

Metadata

Required parameters:

Required parameters when `ALEX`

Required parameters when `lifetime`

Basic layout for "photon-data"

Required Photon-data arrays

Optional photon-data arrays

Photon-data "specs"

Detectors "specs"

Nanotime "specs"

Timestamps and detector: multi-spot layout (TO BE UPDATED)

Sample fields

Measurement setup fields

Additional fields

Clone this wiki locally

HDF5 smFRET format Draft 0.2 (WIP)

Introduction

Overview

What problems we try to solve?

Features of HDF5

HDF5-Ph-Data: Design principles

HDF5-Ph-Data format definition

Metadata

Required parameters:

Required parameters when ALEX

Required parameters when lifetime

Basic layout for "photon-data"

Required Photon-data arrays

Optional photon-data arrays

Photon-data "specs"

Detectors "specs"

Nanotime "specs"

Timestamps and detector: multi-spot layout (TO BE UPDATED)

Sample fields

Measurement setup fields

Additional fields

Clone this wiki locally

Required parameters when `ALEX`

Required parameters when `lifetime`