-
Notifications
You must be signed in to change notification settings - Fork 0
FileFormatsAppendix
Nucleotide Sequences can be provided to GTFold in standardized FASTA format.
In FASTA files, each nucleotide sequence begins with a single-line description that must start with the greater-than symbol (>). Subsequent lines should only contain the sequence itself. The sequence may be formatted with whitespace, which is ignored, however blank lines are not allowed in the middle of FASTA input. FASTA files should have a ".fasta" extension.
(FASTA FORMAT SAMPLE)
>Title of Sequence
AAA GCGG UUTGTT UTCUTaaTCTXXXXUCAGG
UUA GCCG UUTGTT UTCUTaaTCTGGG
A CT (Connectivity Table) file contains secondary structure information for a sequence. These files are saved with a CT extension. When entering a structure to calculate the free energy, the following format must be followed.
- Start of first line: number of bases in the sequence
- End of first line: title of the structure
- Each of the following lines provides information about a given base in the sequence. Each base has its own line, with these elements in order:
- Base number: index n
- Base (A, C, G, T, U, X)
- Index n-1
- Index n+1
- Number of the base to which n is paired. No pairing is indicated by 0 (zero).
- Natural numbering. RNAstructure ignores the actual value given in natural numbering, so it is easiest to repeat n here.
The CT file may hold multiple structures for a single sequence. This is done by repeating the format for each structure without any blank lines between structures.
The CT file format is such that any files generated by RNAstructure are compatible with mfold/Unafold (available from Michael Zuker), and many other software packages.
(CT FORMAT SAMPLE -- EXAMPLE OF OUTPUT PRODUCED BY GTFOLD)
300 ENERGY = 7.0 example
1 G 0 2 22 1
2 G 1 3 21 2
3 G 2 4 20 3
4 C 3 5 19 4
5 G 4 6 0 5
6 A 5 7 0 6
7 A 6 8 0 7
8 U 7 9 18 8
9 U 8 10 17 9
10 G 9 11 16 10
(structure continues to next structure...)
300 ENERGY = 6.2 example
1 G 0 2 0 1
2 G 1 3 0 2
3 G 2 4 20 3
4 C 3 5 19 4
5 G 4 6 0 5
6 A 5 7 0 6
7 A 6 8 0 7
8 U 7 9 18 8
9 U 8 10 17 9
10 G 9 11 16 10
(structure continues to next structure or end of file...)
Folding constraints are saved in plain text with a CON extension. These can be hand edited. For multiple entries of a specific type of constraint, entries are each listed on a separate line. When there is no constraint of a type, there are no lines required. Note that all specifiers, followed by "-1" or "-1 -1", are expected by RNAstructure. For all specifiers that take two arguments, it is assumed that the first argument is the lower base pair number. The file format is as follows:
(CFF FORMAT EXAMPLE)
DS:
XA
-1
SS:
XB
-1
Mod:
XC
-1
Pairs:
XD1 XD2
-1 -1
FMN:
XE
-1
Forbids:
XF1 XF2
-1 -1
- XA: Nucleotides that will be double-stranded
- XB: Nucleotides that will be single-stranded (unpaired)
- XC: Nucleotides accessible to chemical modification
- XD1, XD2: Forced base pairs
- XE: Nucleotides accessible to FMN cleavage
- XF1, XF2: Prohibited base pairs
(ANOTHER CFF FORMAT SAMPLE)
DS:
15
25
76
-1
SS:
17
18
20
35
-1
Mod:
2
15
-1
Pairs:
16 26
-1 -1
FMN:
-1
Forbids:
15 27
-1 -1
The file format for SHAPE reactivity comprises two columns. The first column is the nucleotide number, and the second is the reactivity.
Nucleotides for which there is no SHAPE data can either be left out of the file, or the reactivity can be entered as less than -500. Columns are separated by any white space.
Note that there is no header information. Nucleotides 1 through 10 have no reactivity information. Nucleotide 11 has a normalized SHAPE reactivity of 0.042816. Nucleotide 12 has a normalized SHAPE reactivity of 0, which is NOT the same as having no reactivity when using the pseudo-energy constraints.
(SHAPE FORMAT SAMPLE)
9 -999
10 -999
11 0.042816
12 0
13 0.15027
14 0.16201