Skip to content

Latest commit

 

History

History
239 lines (190 loc) · 6.61 KB

SPECIFICATIONS.rst

File metadata and controls

239 lines (190 loc) · 6.61 KB

Specifications

Fastq sequence description

Fields in fastq description:

Key Description
instrument Instrument ID
run_number Run number on instrument.
flowcell_ids Flowcell Identifier
flowcell_ids Flowcell IDS
lane Lane number
tile Tile number
x_pos Position X of cluster
y_pos Position Y of cluster
UMI Optional, appears when UMI is specified in sample sheet. UMI sequences for Read 1 and Read 2, seperated by a plus [+]
read Read number - 1 can be single read or Read 2 of paired-end
is_filtered Y if the read is filtered (did not pass), N otherwise
control_number 0 when none of the control bits are on, otherwise it is an even number. On HiSeq X and NextSeq systems, control specification is not performed and this number is always 0.
index Index of the read

See also https://help.basespace.illumina.com/files-used-by-basespace/fastq-files

Filter file

The filter files can be found in the BaseCalls directory. The filter file specifies whether a cluster passed filters. Filter files are generated at cycle 26 using 25 cycles of data. For each tile, one filter file is generated. Location: Data/Intensities/BaseCalls/L001 File format: s_[lane]_[tile].filter

The format is described below

Bytes Description
0-3 Zero value (for backwards compatibility)
4-7 Filter format version number
8-11 Number of clusters
12-(N+11) Where N is the cluster number. unsigned 8-bits integer Bit 0 is pass or failed filter

Filter bytes example:

bytes([0, 0, 0, 0]) # prefix 0
bytes([3, 0, 0, 0]) # version 3
struct.pack("<I", cluster_count) # number of cluster in little endian unsigned int
bytes([1]*cluster_count) # For each cluster an unsigned 8-bits integer Where Bit 0 is pass or failed filter

1 == PASS FILTER
0 == NO PASS FILTER

In hexdump:

BYTES 0-3      BYTES 4-7      BYTES 8-11     BYTES 12-14
00 00 00 00    03 00 00 00    03 00 00 00    01 01 01

At bytes 8-11 I have 3 clusters and each cluster is represented by a an unsigned 8-bit integer.

Control file

The control files are binary files containing control results.

Bytes Description
0-3 Zero value (for backwards compatibility)
4-7 Format version number
12-(2xN+11)
Where N is the cluster number
  • Bit 0: always empty (0)
  • Bit 1: was the read identified as a control?
  • Bit 2: was the match ambiguous?
  • Bit 3: did the read match the phiX tag?
  • Bit 4: did the read align to match the phiX tag?
  • Bit 5: did the read match the control index sequence?
  • Bits 6,7: reserved for future use
  • Bits 8..15: the report key for the matched record in the controls.fasta file (specified by the REPORT_KEY metadata)

Locations file

The BCL to FASTQ converter can use different types of position files and will expect a type based on the version of RTA used The locs files can be found in the Intensities/L<lane> directories

Bcl file

The BCL files can be found in the BaseCalls directory inside the run directory: Data/Intensities/BaseCalls/L<lane>/C<cycle>.1

They are named as follows:

s_<lane>_<tile>.bcl

Format:

Bytes Description
0-3 Number of N clusters in unsigned 32bits little endian integer
4-(N+3)
Unsigned 8 bits integer
  • Bits 0-1 are bases encoded as: [A,C,G,T] -> [0,1,2,3] -> [00,01,10,11]
  • Bits 2-7 are shifted by 2 bits and contain the quality score.
  • All bits '0' is reserved for no call (N)

Stat file

The stats files can be found in the BaseCalls directory inside the run directory: Data/Intensities/BaseCalls/L00<lane>/C<cycle>.1

They are named as follows:

s_<lane>_<tile>.stats

The Stats file is a binary file containing base calling statistics; the content is described below.

The data is for clusters passing filter only:

Start Description Data type
Byte 0 Cycle number integer
Byte 4 Rverage Cycle Intensity double
Byte 12 Average intensity for A over all clusters with intensity for A double
Byte 20 Average intensity for C over all clusters with intensity for C double
Byte 28 Average intensity for G over all clusters with intensity for G double
Byte 44 Average intensity for A over clusters with base call A double
Byte 52 Average intensity for C over clusters with base call C double
Byte 60 Average intensity for G over clusters with base call G double
Byte 68 Average intensity for T over clusters with base call T double
Byte 76 Number of clusters with base call A integer
Byte 80 Number of clusters with base call C integer
Byte 84 Number of clusters with base call G integer
Byte 88 Number of clusters with base call T integer
Byte 92 Number of clusters with base call X integer
Byte 96 Number of clusters with intensity for A integer
Byte 100 Number of clusters with intensity for C integer
Byte 104 Number of clusters with intensity for G integer
Byte 108 Number of clusters with intensity for T integer

References

See also mkdata.sh file in bcl2fastq source code for insights on bcl format.