From early on, DSSR-derived nucleic acid secondary structures have been written in the compact dot-bracket notation (.dbn) with pseudo-knot information. To better connect DSSR to the 2D world, I recently looked into the connect (.ct) format, which was first introduced by Zuker’s mfold program. Over time, the .ct format has become one of the most commonly used RNA secondary structure formats, and it is more expressive than the .dbn format (see below).
As of v1.0, for each analyzed structure, DSSR produces two secondary structure files with default names dssr-2ndstrs.dbn
and dssr-2ndstrs.ct
, in .dbn and .ct formats, respectively. Using the 27-nucleotides (nt) RNA fragment 1msy as an example, the DSSR-derived secondary structure in .dbn and .ct formats are shown below:
In dot-bracket notation (.dbn) [dssr-2ndstrs.dbn] ------------------------------------------------------ >1msy nts=27 DSSR-derived secondary structure UGCUCCUAGUACGUAAGGACCGGAGUG .(((((.....(....)....))))). ------------------------------------------------------ In connect format (.ct) [dssr-2ndstrs.ct] ------------------------------------------------------ 27 DSSR-derived secondary structure in '1msy' 1 U 0 2 0 2647 # name=A.U2647 2 G 1 3 26 2648 # name=A.G2648, pairedNt=A.U2672 3 C 2 4 25 2649 # name=A.C2649, pairedNt=A.G2671 4 U 3 5 24 2650 # name=A.U2650, pairedNt=A.A2670 5 C 4 6 23 2651 # name=A.C2651, pairedNt=A.G2669 6 C 5 7 22 2652 # name=A.C2652, pairedNt=A.G2668 7 U 6 8 0 2653 # name=A.U2653 8 A 7 9 0 2654 # name=A.A2654 9 G 8 10 0 2655 # name=A.G2655 10 U 9 11 0 2656 # name=A.U2656 11 A 10 12 0 2657 # name=A.A2657 12 C 11 13 17 2658 # name=A.C2658, pairedNt=A.G2663 13 G 12 14 0 2659 # name=A.G2659 14 U 13 15 0 2660 # name=A.U2660 15 A 14 16 0 2661 # name=A.A2661 16 A 15 17 0 2662 # name=A.A2662 17 G 16 18 12 2663 # name=A.G2663, pairedNt=A.C2658 18 G 17 19 0 2664 # name=A.G2664 19 A 18 20 0 2665 # name=A.A2665 20 C 19 21 0 2666 # name=A.C2666 21 C 20 22 0 2667 # name=A.C2667 22 G 21 23 6 2668 # name=A.G2668, pairedNt=A.C2652 23 G 22 24 5 2669 # name=A.G2669, pairedNt=A.C2651 24 A 23 25 4 2670 # name=A.A2670, pairedNt=A.U2650 25 G 24 26 3 2671 # name=A.G2671, pairedNt=A.C2649 26 U 25 27 2 2672 # name=A.U2672, pairedNt=A.G2648 27 G 26 0 0 2673 # name=A.G2673 ------------------------------------------------------
Presumably, the .ct format is very simple, and examining a sample file as shown above would give one a pretty good sense of what each column is about. While there exist many oversimplified descriptions of the .ct format on the web, the most detailed and accurate explanation is from the mfold manual:
The ``ct’‘ file (connect table) contains the sequence and base pair information, and is meant to be an input file for a structure drawing program. In addition to containing base pair information, it also lists the 5′ and 3′ neighbor of each base, allowing for the representation of circular RNA or multiple molecules. The ct file also lists the historical base numbering in the original sequence, as bases and base pairs are numbered according from 1 to the size of the folded segment. A portion of a ct file is displayed in Figure 12.
Figure 12: The ct file for the second and final folding of S. cerevisiae Phe-tRNA at 37°, with default parameters. The first record displays the fragment size (76), ΔG and sequence name. The ith subsequent record contains, in order, i, ri, the index of the 5′-connecting base, the index of the 3′-connecting base, the index of the paired base and the historical numbering of the ith base in the original sequence. The 5′, 3′ and base pair indices are 0 when there is no connection or base pair.
Specifically, the 3rd, 4th, and 6th columns in the .ct format convey specific information; by design, they are not redundant to information contained in the 1st column. Note that in the above ‘1msy’ example, the 6th column gives the nt sequence numbers (as in the PDB datafile) instead of the serial numbers (as in the 1st column). The DSSR produced .ct files also contain extra information after ‘#’, in the comma separated key=value format.
As an example of the usefulness of the 3rd and 4th columns, have a look of the DSSR-derived .ct file for the Dickerson DNA dodecamer duplex with sequence CGCGAATTCGCG:
24 DSSR-derived secondary structure in '355d' 1 C 0 2 24 1 # name=A.DC1, pairedNt=B.DG24 2 G 1 3 23 2 # name=A.DG2, pairedNt=B.DC23 3 C 2 4 22 3 # name=A.DC3, pairedNt=B.DG22 4 G 3 5 21 4 # name=A.DG4, pairedNt=B.DC21 5 A 4 6 20 5 # name=A.DA5, pairedNt=B.DT20 6 A 5 7 19 6 # name=A.DA6, pairedNt=B.DT19 7 T 6 8 18 7 # name=A.DT7, pairedNt=B.DA18 8 T 7 9 17 8 # name=A.DT8, pairedNt=B.DA17 9 C 8 10 16 9 # name=A.DC9, pairedNt=B.DG16 10 G 9 11 15 10 # name=A.DG10, pairedNt=B.DC15 11 C 10 12 14 11 # name=A.DC11, pairedNt=B.DG14 12 G 11 0 13 12 # name=A.DG12, pairedNt=B.DC13 13 C 0 14 12 13 # name=B.DC13, pairedNt=A.DG12 14 G 13 15 11 14 # name=B.DG14, pairedNt=A.DC11 15 C 14 16 10 15 # name=B.DC15, pairedNt=A.DG10 16 G 15 17 9 16 # name=B.DG16, pairedNt=A.DC9 17 A 16 18 8 17 # name=B.DA17, pairedNt=A.DT8 18 A 17 19 7 18 # name=B.DA18, pairedNt=A.DT7 19 T 18 20 6 19 # name=B.DT19, pairedNt=A.DA6 20 T 19 21 5 20 # name=B.DT20, pairedNt=A.DA5 21 C 20 22 4 21 # name=B.DC21, pairedNt=A.DG4 22 G 21 23 3 22 # name=B.DG22, pairedNt=A.DC3 23 C 22 24 2 23 # name=B.DC23, pairedNt=A.DG2 24 G 23 0 1 24 # name=B.DG24, pairedNt=A.DC1
Note the 0 at the 4th column for A.DG12 which is at the 3′ end of chain A, and the 0 at 3rd column for B.DC13 which is at the 5′ end of chain B.