It is a great pleasure to note that a paper titled DSSR, an integrated software tool for dissecting the spatial structure of RNA has recently been published in Nucleic Acids Research (NAR). Co-authored by Harmen Bussemaker, Wilma Olson and me (a team with a unique combination of complementary expertise), this DSSR paper represents another solid piece of work that I feel proud of. In contrast to our previous GpU dinucleotide platform paper focusing on results, and the two major 3DNA papers concentrating on methods, the current NAR article describes significant scientific findings that are enabled by the novel analysis algorithms implemented in the program. Moreover, DSSR introduces an appealing and highly informative “cartoon-block” representation of RNA structures that combines PyMOL cartoon schematics with 3DNA base color-coded rectangular blocks.
The abstract of the paper is quoted below:
Insight into the three-dimensional architecture of RNA is essential for understanding its cellular functions. However, even the classic transfer RNA structure contains features that are overlooked by existing bioinformatics tools. Here we present DSSR (Dissecting the Spatial Structure of RNA), an integrated and automated tool for analyzing and annotating RNA tertiary structures. The software identifies canonical and noncanonical base pairs, including those with modified nucleotides, in any tautomeric or protonation state. DSSR detects higher-order coplanar base associations, termed multiplets. It finds arrays of stacked pairs, classifies them by base-pair identity and backbone connectivity, and distinguishes a stem of covalently connected canonical pairs from a helix of stacked pairs of arbitrary type/linkage. DSSR identifies coaxial stacking of multiple stems within a single helix and lists isolated canonical pairs that lie outside of a stem. The program characterizes ‘closed’ loops of various types (hairpin, bulge, internal, and junction loops) and pseudoknots of arbitrary complexity. Notably, DSSR employs isolated pairs and the ends of stems, whether pseudoknotted or not, to define junction loops. This new, inclusive definition provides a novel perspective on the spatial organization of RNA. Tests on all nucleic acid structures in the Protein Data Bank confirm the efficiency and robustness of the software, and applications to representative RNA molecules illustrate its unique features. DSSR and related materials are freely available at http://x3dna.org/.
During the review process, we are delighted that the referees confirmed the claim that we made in the cover letter: “We would also like to emphasize that our reported results are easily verifiable, and we assure rigorous reproducibility of the data and figures described in this article.” Now that the paper has been published, as a follow-up, I’ve made available all the scripts and data files associated with the paper in a new section DSSR-NAR paper on the 3DNA Forum. The DSSR User Manual has also been updated with additional, previously undocumented, auxiliary options.
Overall, it took me more than ten days to create the 19 posts in the DSSR-NAR paper section and to revise the DSSR User Manual, along with other minor refinements for consistency. During the process, I’ve tried to make the scripts and data files self-contained for wide accessibility and easy understanding.
Any interested party should now be able to reproduce the table and figures (including the supplementary data) reported in the article. Moreover, with the additional details given in the post RNA cartoon-block representations with PyMOL and DSSR, one can easily generate similar schematic images as shown below:
I feel confident to claim that the results reported in our DSSR paper are reproducible. If you have issues related to the paper, please post them on the 3DNA Forum. I strive to respond promptly to any questions asked there.
In summary, DSSR is an integrated computational tool, designed from the bottom up to streamline the analysis of RNA three-dimensional structures. It is built upon my extensive experience in supporting 3DNA, growing knowledge of RNA structures, and refined programming skills. DSSR has a combined set of functionalities well beyond the scope of any known specialized resources. The program may well serve as a cornerstone for RNA structural bioinformatics and will benefit a broad range of possible applications.
The conformation of the five-membered sugar ring in DNA/RNA structures can be characterized using the five corresponding endocyclic torsion angles (shown below).
On account of the five-member ring constraint, the conformation can be characterized approximately by
5 - 3 = 2 parameters. Using the concept of pseudorotation of the sugar ring, the two parameters are the amplitude (τm) and phase angle (P, in the range of 0° to 360°).
One set of widely used formula to convert the five torsion angles to the pseudorotation parameters is due to Altona & Sundaralingam (1972): “Conformational Analysis of the Sugar Ring in Nucleosides and Nucleotides. A New Description Using the Concept of Pseudorotation” [J. Am. Chem. Soc., 94(23), pp 8205–8212]. As always, the concept is best illustrated with an example. Here I use the sugar ring of G4 (chain A) of the Dickerson-Drew dodecamer (1bna), with Matlab/Octave code:
# xyz coordinates of the sugar ring: G4 (chain A), 1bna
ATOM 63 C4' DG A 4 21.393 16.960 18.505 1.00 53.00
ATOM 64 O4' DG A 4 20.353 17.952 18.496 1.00 38.79
ATOM 65 C3' DG A 4 21.264 16.229 17.176 1.00 56.72
ATOM 67 C2' DG A 4 20.793 17.368 16.288 1.00 40.81
ATOM 68 C1' DG A 4 19.716 17.901 17.218 1.00 30.52
# endocyclic torsion angles:
v0 = -26.7; v1 = 46.3; v2 = -47.1; v3 = 33.4; v4 = -4.4
Pconst = sin(pi/5) + sin(pi/2.5) # 1.5388
P0 = atan2(v4 + v1 - v3 - v0, 2.0 * v2 * Pconst); # 2.9034
tm = v2 / cos(P0); # amplitude: 48.469
P = 180/pi * P0; # phase angle: 166.35 [P + 360 if P0 < 0]
The Altona & Sundaralingam (1972) pseudorotation parameters are what have been adopted in 3DNA, following the NewHelix program of Dr. Dickerson. The Curves+ program, on the other hand, uses another (newer) set of formula due to Westhof & Sundaralingam (1983): “A Method for the Analysis of Puckering Disorder in Five-Membered Rings: The Relative Mobilities of Furanose and Proline Rings and Their Effects on Polynucleotide and Polypeptide Backbone Flexibility” [J. Am. Chem. Soc., 105(4), pp 970–976]. The two sets of formula, by Altona & Sundaralingam (1972) and Westhof & Sundaralingam (1983), give slightly different numerical values for the two pseudorotation parameters (τm and P).
Since 3DNA and Curves+ are currently two of the most widely used programs for conformational analysis of nucleic acid structures, the subtle differences in pseudorotation parameters may cause confusions for users who use (or are familiar with) both programs. Over the past few years, I have indeed received such questions via email.
With the same G4 (chain A, 1bna) sugar ring, here is the Matlab/Octave script showing how Curve+ calculates the pseudorotation parameters:
# xyz coordinates of sugar ring G4 (chain A, 1bna)
# endocyclic torsion angles, same as above
v0 = -26.7; v1 = 46.3; v2 = -47.1; v3 = 33.4; v4 = -4.4
v = [v2, v3, v4, v0, v1]; # reorder them into vector v
A = 0; B = 0;
for i = 1:5
t = 0.8 * pi * (i - 1);
A += v(i) * cos(t);
B += v(i) * sin(t);
A *= 0.4; # -48.476
B *= -0.4; # 11.516
tm = sqrt(A * A + B * B); # 49.825
c = A/tm; s = B/tm;
P = atan2(s, c) * 180 / pi; # 166.64
For this specific example, i.e., the sugar ring of G4 (chain A, 1bna), the pseudorotation parameters as calculated by 3DNA per Altona & Sundaralingam (1972) and Curves+ per Westhof & Sundaralingam (1983) are as follows:
amplitude phase angle
3DNA 48.469 166.35
Curves+ 49.825 166.64
Needless to say, the differences are subtle, and few people will notice/bother at all. For those who do care about such little details, however, hopefully this post will help you understand where the differences actually come from.
For consistency with the 3DNA output, DSSR (by default) also follows the Altona & Sundaralingam (1972) definitions of sugar pseudorotation. Nevertheless, DSSR also contains an undocumented option,
--sugar-pucker=westhof83, to output τm and P according to the Westhof & Sundaralingam (1983) definitions.
Each sugar is assigned into one of the following ten puckering modes, by dividing the phase angle (P, in the range of 0° to 360°) into 36° ranges reach.
C3'-endo, C4'-exo, O4'-endo, C1'-exo, C2'-endo,
C3'-exo, C4'-endo, O4'-exo, C1'-endo, C2'-exo
For sugars in nucleic acid structures, C3’-endo [0°, 36°) and C2’-endo [144°, 180°) are predominant. The former corresponds to sugars in ‘canonical’ RNA or A-form DNA, and the latter in sugars of standard B-form DNA. In reality, RNA structures as deposited in the PDB could also contain C2′-endo sugars. One significant example is the GpU dinucleotide platforms, where the 5′-ribose sugar (G) is in the C2′-endo form and the 3′-sugar (U) in the C3′-endo form — see my blog post, titled ‘Is the O2′(G)…O2P H-bond in GpU platforms real?’.
- This post is based on my 2011-06-11 blog post with the same title.
- While visiting Lyon in July 2014, I had the opportunity to hear Dr. Lavery’s opinion on adopting the Westhof & Sundaralingam (1983) sugar-pucker definitions in Curves+. I learned that the new formula are more robust in rare, extreme cases of sugar conformation than the 1972 variants. After all, Dr. Sundaralingam is a co-author on both papers. It is possible that in future releases of DSSR, the new 1983 formula for sugar pucker would become the default.
In the DSSR v1.2.7-2015jun09 release, I documented two additional command-line options (
--cleanup) that are related to the various auxiliary files. As a matter of fact, these two options (among quite a few others) have been there for a long time, but without being explicitly described. The point is not to hide but to simplify — one of the design goals of DSSR is simplicity. DSSR has already possessed numerous key functionality to be appreciated. Before DSSR is firmly established in the RNA bioinformatics field, I beleive too many nonessential “features” could be distracting. While writing and refining the DSSR code, I do feel that some ‘auxiliary’ features could be handy for experienced users (including myself). So along the way, I’ve added many ‘hidden’ options that are either experimental or potentially useful.
On one side, I sense it is acceptable for a scientific software to actually does more than it claims. On the other hand, I have always been quick in addressing users’ requests — as one example, check for the
--select option recently introduced into DSSR in response to a user request, and the ‘hidden’
--dbn-break option for specifying the symbol to separate multiple chains or chain breaks in DSSR-derived dot-bracket notation.
--cleanup, the purposes of these two closely related options can be best illustrated using the yeast phenylalanine tRNA structure (1ehz) as an example. By default, running
x3dna-dssr -i=1ehz.pdb will produce a total of 11 auxiliary files, with names prefixed with
dssr-, as shown below:
List of 11 additional files
1 dssr-stems.pdb -- an ensemble of stems
2 dssr-helices.pdb -- an ensemble of helices (coaxial stacking)
3 dssr-pairs.pdb -- an ensemble of base pairs
4 dssr-multiplets.pdb -- an ensemble of multiplets
5 dssr-hairpins.pdb -- an ensemble of hairpin loops
6 dssr-junctions.pdb -- an ensemble of junctions (multi-branch)
7 dssr-2ndstrs.bpseq -- secondary structure in bpseq format
8 dssr-2ndstrs.ct -- secondary structure in connect table format
9 dssr-2ndstrs.dbn -- secondary structure in dot-bracket notation
10 dssr-torsions.txt -- backbone torsion angles and suite names
11 dssr-stacks.pdb -- an ensemble of stacks
With ‘fixed’ generic names by default, users can run DSSR in a directory repeatedly without creating too many files. This practice follows that used in the 3DNA suite of programs. However, my experience in supporting 3DNA over the years has shown that users (myself included) may want to explore further some of the files, e.g. ‘dssr-multiplets.pdb’ for displaying the base multiplets (four triplets here). One could easily use command-line (script) to change a generic name to a more appropriate one: e.g.,
mv dssr-multiplets.pdb 1ehz-multiplets.pdb for 1ehz. A better solution, however, is by introducing a customized prefix to the additional files, and that’s exactly where the
--prefix option comes in. The option is specified like this:
--prefix=text where text can be any string as appropriate. So running
x3dna-dssr -i=1ehz.pdb --prefix=1ehz, for example, will lead to the following output:
List of 11 additional files
1 1ehz-stems.pdb -- an ensemble of stems
2 1ehz-helices.pdb -- an ensemble of helices (coaxial stacking)
3 1ehz-pairs.pdb -- an ensemble of base pairs
4 1ehz-multiplets.pdb -- an ensemble of multiplets
5 1ehz-hairpins.pdb -- an ensemble of hairpin loops
6 1ehz-junctions.pdb -- an ensemble of junctions (multi-branch)
7 1ehz-2ndstrs.bpseq -- secondary structure in bpseq format
8 1ehz-2ndstrs.ct -- secondary structure in connect table format
9 1ehz-2ndstrs.dbn -- secondary structure in dot-bracket notation
10 1ehz-torsions.txt -- backbone torsion angles and suite names
11 1ehz-stacks.pdb -- an ensemble of stacks
--cleanup option, as its name implies, is to tidy up a directory by removing the auxiliary files generated by DSSR. The usage is very simple:
x3dna-dssr --cleanup --prefix=1ehz
The former gets rid of the default ‘fixed’ generic auxiliary files (
dssr-pairs.pdb etc), whilst the latter deletes prefixed supporting files (
Recently, I came across and have been surprised by the different assignment of HETATM vs. ATOM records for modified nucleotides in PDB vs. PDBx/mmCIF format. As always, the issue is best illustrated with a concrete example. Here is what I observed in the PDB entry 1ehz, the crystal structure of yeast phenylalanine tRNA at 1.93 Å resolution.
DSSR identifies 14 modified nucleotides (of 11 types) in 1ehz as shown below:
List of 11 types of 14 modified nucleotides
nt count list
1 1MA-a 1 A.1MA58
2 2MG-g 1 A.2MG10
3 5MC-c 2 A.5MC40,A.5MC49
4 5MU-t 1 A.5MU54
5 7MG-g 1 A.7MG46
6 H2U-u 2 A.H2U16,A.H2U17
7 M2G-g 1 A.M2G26
8 OMC-c 1 A.OMC32
9 OMG-g 1 A.OMG34
10 PSU-P 2 A.PSU39,A.PSU55
11 YYG-g 1 A.YYG37
1ehz.pdb downloaded from RCSB PDB, all the 14 modified nucleotides are assigned as HETATM whereas in
1ehz.cif the corresponding records are ATOM. Here is the excerpt for 1MA58 in PDB format:
HETATM 1252 P 1MA A 58 73.770 67.765 34.057 1.00 30.65 P
HETATM 1253 OP1 1MA A 58 72.638 67.886 33.105 1.00 32.84 O
HETATM 1254 OP2 1MA A 58 73.621 68.229 35.450 1.00 29.49 O
HETATM 1255 O5' 1MA A 58 74.315 66.273 34.254 1.00 28.81 O
HETATM 1256 C5' 1MA A 58 74.592 65.439 33.080 1.00 29.42 C
HETATM 1257 C4' 1MA A 58 74.279 63.972 33.383 1.00 33.42 C
HETATM 1258 O4' 1MA A 58 74.880 63.685 34.667 1.00 32.36 O
HETATM 1259 C3' 1MA A 58 72.789 63.573 33.509 1.00 35.13 C
HETATM 1260 O3' 1MA A 58 72.625 62.168 33.250 1.00 36.80 O
HETATM 1261 C2' 1MA A 58 72.560 63.667 35.012 1.00 34.80 C
HETATM 1262 O2' 1MA A 58 71.525 62.828 35.506 1.00 36.27 O
HETATM 1263 C1' 1MA A 58 73.908 63.150 35.551 1.00 33.62 C
HETATM 1264 N9 1MA A 58 74.284 63.494 36.930 1.00 30.36 N
HETATM 1265 C8 1MA A 58 73.887 64.574 37.688 1.00 34.55 C
HETATM 1266 N7 1MA A 58 74.415 64.610 38.899 1.00 33.32 N
HETATM 1267 C5 1MA A 58 75.204 63.469 38.953 1.00 33.37 C
HETATM 1268 C6 1MA A 58 76.031 62.941 39.948 1.00 33.58 C
HETATM 1269 N6 1MA A 58 76.184 63.488 41.134 1.00 41.19 N
HETATM 1270 N1 1MA A 58 76.708 61.803 39.669 1.00 34.48 N
HETATM 1271 CM1 1MA A 58 77.649 61.222 40.626 1.00 31.43 C
HETATM 1272 C2 1MA A 58 76.527 61.216 38.479 1.00 28.43 C
HETATM 1273 N3 1MA A 58 75.793 61.624 37.453 1.00 31.67 N
HETATM 1274 C4 1MA A 58 75.142 62.771 37.747 1.00 33.02 C
The corresponding section in PDBx/mmCIF format is:
ATOM 1252 P P . 1MA A 1 58 ? 73.770 67.765 34.057 1.00 30.65 ? ? ? ? ? ? 58 1MA A P 1
ATOM 1253 O OP1 . 1MA A 1 58 ? 72.638 67.886 33.105 1.00 32.84 ? ? ? ? ? ? 58 1MA A OP1 1
ATOM 1254 O OP2 . 1MA A 1 58 ? 73.621 68.229 35.450 1.00 29.49 ? ? ? ? ? ? 58 1MA A OP2 1
ATOM 1255 O "O5'" . 1MA A 1 58 ? 74.315 66.273 34.254 1.00 28.81 ? ? ? ? ? ? 58 1MA A "O5'" 1
ATOM 1256 C "C5'" . 1MA A 1 58 ? 74.592 65.439 33.080 1.00 29.42 ? ? ? ? ? ? 58 1MA A "C5'" 1
ATOM 1257 C "C4'" . 1MA A 1 58 ? 74.279 63.972 33.383 1.00 33.42 ? ? ? ? ? ? 58 1MA A "C4'" 1
ATOM 1258 O "O4'" . 1MA A 1 58 ? 74.880 63.685 34.667 1.00 32.36 ? ? ? ? ? ? 58 1MA A "O4'" 1
ATOM 1259 C "C3'" . 1MA A 1 58 ? 72.789 63.573 33.509 1.00 35.13 ? ? ? ? ? ? 58 1MA A "C3'" 1
ATOM 1260 O "O3'" . 1MA A 1 58 ? 72.625 62.168 33.250 1.00 36.80 ? ? ? ? ? ? 58 1MA A "O3'" 1
ATOM 1261 C "C2'" . 1MA A 1 58 ? 72.560 63.667 35.012 1.00 34.80 ? ? ? ? ? ? 58 1MA A "C2'" 1
ATOM 1262 O "O2'" . 1MA A 1 58 ? 71.525 62.828 35.506 1.00 36.27 ? ? ? ? ? ? 58 1MA A "O2'" 1
ATOM 1263 C "C1'" . 1MA A 1 58 ? 73.908 63.150 35.551 1.00 33.62 ? ? ? ? ? ? 58 1MA A "C1'" 1
ATOM 1264 N N9 . 1MA A 1 58 ? 74.284 63.494 36.930 1.00 30.36 ? ? ? ? ? ? 58 1MA A N9 1
ATOM 1265 C C8 . 1MA A 1 58 ? 73.887 64.574 37.688 1.00 34.55 ? ? ? ? ? ? 58 1MA A C8 1
ATOM 1266 N N7 . 1MA A 1 58 ? 74.415 64.610 38.899 1.00 33.32 ? ? ? ? ? ? 58 1MA A N7 1
ATOM 1267 C C5 . 1MA A 1 58 ? 75.204 63.469 38.953 1.00 33.37 ? ? ? ? ? ? 58 1MA A C5 1
ATOM 1268 C C6 . 1MA A 1 58 ? 76.031 62.941 39.948 1.00 33.58 ? ? ? ? ? ? 58 1MA A C6 1
ATOM 1269 N N6 . 1MA A 1 58 ? 76.184 63.488 41.134 1.00 41.19 ? ? ? ? ? ? 58 1MA A N6 1
ATOM 1270 N N1 . 1MA A 1 58 ? 76.708 61.803 39.669 1.00 34.48 ? ? ? ? ? ? 58 1MA A N1 1
ATOM 1271 C CM1 . 1MA A 1 58 ? 77.649 61.222 40.626 1.00 31.43 ? ? ? ? ? ? 58 1MA A CM1 1
ATOM 1272 C C2 . 1MA A 1 58 ? 76.527 61.216 38.479 1.00 28.43 ? ? ? ? ? ? 58 1MA A C2 1
ATOM 1273 N N3 . 1MA A 1 58 ? 75.793 61.624 37.453 1.00 31.67 ? ? ? ? ? ? 58 1MA A N3 1
ATOM 1274 C C4 . 1MA A 1 58 ? 75.142 62.771 37.747 1.00 33.02 ? ? ? ? ? ? 58 1MA A C4 1
While I have not tested exhaustively, it seems true that PDBx/mmCIF has adopted a different definition of what constitutes a HETATM residue. It is worth noting that results from 3DNA and DSSR/SNAP are not effected by the conflicting assignments.
Nowadays, “big data” and “big science” are hot topics. They all sound good and certainly come about for a reason. Yet, to transform data to information to knowledge to understanding to wisdom, sophisticated software tools are required. The programs can be big and complicated, or small and self-contained, fitting different purposes. As long as they can get the claimed job done in a robust fashion, size should not be a concern.
Over the years, however, I have seen a trend of bloated software with many (fragile) dependencies in bioinformatics. Some tools are so picky and hard to use/maintain that instead of serving, they become sort of a master. As a more representative example, I recently tried to install an open-source software associated with a paper published just a few years ago in a leading journal. The software has only a few dependencies, yet some of them have already become obsolete. I spent hours each time, on Mac OS X and two versions of Ubuntu Linux, but failed to get it running properly (always abort with error messages). The download page hosting the software has been inactive since around the publication of the paper. Presumably, the PhD student or postdoc who wrote the code had left the lab, and with a paper published, all is done!
As an active practitioner of bioinformatics for well over a decade, I can confidently claim to be well above average in familiarity with Linux/Mac OS X and associated shell programming and make etc tools, and various common scripting and compiled programming languages. Yet, once in a while, I get frustrated when I try to download and install a software tool attached to a paper I am interested in. As I see it, the vast majority of software programs from research labs are publication-oriented — as long a paper is published, it is finished.
From my experience, I always see software as engineering. It needs careful design and great attention to meticulous details. A sophisticated piece of scientific software is a combination of science and engineering. Expertise in domain knowledge is a must, and refined skills in computer programming is indispensable. The DSSR program I created and continuously refined over the past three years represents what a scientific software should be in my believe.
Among other unique features, DSSR is tiny (< 1mb), self-contained (without run-time dependencies) and runs on Windows, Mac OS X, and Linux. Getting DSSR up and running should take only minutes by any one with basic familiarity of common computer systems. I have no doubt that the beauty of being small as represented by DSSR will be gradually appreciated by the community.
Over the past few weeks, I’ve had the pleasure to talk to Thomas Holder, the PyMOL Principal Developer at Schrödinger, on possible integration of DSSR into PyMOL. On Tuesday April 21, 2015, I wrote to Thomas:
Last year, I had the please to collaborate with Dr. Robert Hanson to integrate DSSR into Jmol, see
http://chemapps.stolaf.edu/jmol/jsmol/dssr.htm. I am wondering if you have any interest in connecting DSSR to PyMOL. This will not only benefit both parties, but also bring elaborate analyses of RNA structures to the general audience. As you may be aware, RNA is becoming increasing important, yet the field of RNA structural bioinformatics is lagging (far) behind that of proteins.
After a few meet-ups, we all agree that the DSSR-PyMOL integration project would be meaningful/significant for RNA structural bioinformatics. Moreover, the community not only can benefit from the end result, but also should be able to make direct contributions through the process. On Friday May 08, 2015, Thomas sent out the following open invitation, titled Someone interested in writing a DSSR plugin for PyMOL?, to the PyMOL mailing list:
Is anyone interested in writing a DSSR plugin for PyMOL? DSSR is an integrated software tool for Dissecting the Spatial Structure of RNA (http://x3dna.bio.columbia.edu/docs/dssr-manual.pdf). Among other things, DSSR defines the secondary structure of RNA from 3D atomic coordinates in a way similar to DSSP does for proteins. Most of its output could be translated 1:1 into PyMOL selections, making it available for coloring and other selection based features. A PyMOL plugin could act as a wrapper which runs DSSR for an object or atom selection. Xiang-Jun Lu, the author of DSSR, is also working on base pair visualization (see http://x3dna.org/articles/seeing-is-understanding-as-well-as-believing), similar to (but more advanced) what’s already available from 3DNA (http://pymolwiki.org/index.php/3DNA).
Xiang-Jun would be happy to collaborate with someone who has experience with Python and the PyMOL API for writing an extension or plugin. Please contact me if this sounds appealing to you.
Get DSSR from http://x3dna.org/
See it hooked up with JSmol: http://chemapps.stolaf.edu/jmol/jsmol/dssr.htm
If you are self-motivated, care about software quality, have expertise in writing PyMOL plugin, and feel the pain in RNA structural analysis/visualization with currently available tools, now it is the time to make a difference. The DSSR/PyMOL project would ideally be composed of a team of dedicated practitioners with complementary skills. We will communicate mostly via email or online forum, in a presumably open and highly interactive way. By working on the project, you will be able to sharpen your skills and make new friends. The end product would not only make RNA structural bioinformatics easier for yourself but also benefit the community at large.
From the very beginning, 3DNA contains two key programs,
rebuild, for the analysis and rebuilding of nucleic acid 3D structures. The two names are short and to the point, but with one caveat. They are common verbs that can be easily picked up by other software packages. When 3DNA and such packages are installed on the same machine, naming clashes happen. If the 3DNA
bin/ directory is searched afterwards, the
rebuild command may have nothing to do with nucleic acid structures at all. Naturally, this naming ambiguity can lead to confusions and frustrations.
I’ve been aware of the
rebuild program name conflict for a long time. Recently, I was surprised by another
analyze program on my Mac OS X Yosemite. As shown from the following output, the
analyze program seems to be installed via Mac
port, and it is about analyzing words in a dictionary file.
~  which analyze
~  analyze -h
correct syntax is:
analyze affix_file dictionary_file file_of_words_to_check
use two words per line for morphological generation
The ambiguous names are exactly the reason that I use
x3dna-snap for the two new programs I’ve been working over the past few years. As for the
rebuild programs in 3DNA v2.x, I’d rather leave them as is. 3DNA is now in wide use in other structural bioinformatic pipelines to allow for easy name changes without causing compatibility issues. On a positive side, once you know the problem, fixing it is straightforward. This post is to raise the awareness of the 3DNA user community about such naming conflicts.
Canonical bases (A, C, G, T and U) in nucleic acid structures have standard atom names, shown below using the Watson-Crick A–T and G–C pairs. Ring atoms of adenine, for example, are named (N1, C2, N3, C4, C5, C6, N7, C8, N9) respectively.
Four characters are reserved for atom names in the PDB format. The convention, as seen in files downloaded from the RCSB PDB, is to put the two-character base name in the middle, as in
.N1.. Note that here each dot (.) is used for a space character to make it stand out.
Long time ago, I became aware a PDB format variant where the base name is left-aligned, as in
N1... This case has ever since been properly handled by 3DNA (including DSSR and SNAP). While checking submitted entries to web-DSSR, I recently noticed yet another PDB format variation in labeling base names with the format of
..N1 (i.e., right-aligned). Without taking this special variant of PDB format into consideration, 3DNA/DSSR reported that “no nucleotides found!” Once the issue is known, however, fixing it is straightforward. As of May 4, 2015, 3DNA v2.2, DSSR and SNAP can all handle this special PDB variant correctly.
Over the years, I have come across many PDB variants claimed to compliant with the loosely defined format. If you find 3DNA or DSSR is not working as expected, it is likely the coordinate file in the self-claimed ‘PDB format’ is at fault. Wherever practical, I’ve tried to incorporate as many non-standard variants as possible.
The NDB (Nucleic Acid Database) is a valuable resource dedicated to “information about experimentally-determined nucleic acids and complex assemblies.” Over the years, however, I’ve gradually switched from NDB to PDB (Protein Data Bank) for my research on nucleic acid structures. NDB is derived from PDB and presumably should contain all nucleic acid structures available in the PDB. However, at the time of this writing (on April 9, 2015), the NDB says: “As of 8-Apr-2015 number of released structures: 7430” and the PDB states “7611 Nucleic Acid Containing Structures”. So PDB has
7611-7430=181 more entries of nucleic acid structures than the NDB, possibly due to a lag in NDB’s processing of newly released PDB structures. Another issue is the inconsistency of the NDB identifier: early entries have e.g.
bdl084 for B-DNA (355d in PDB), but now NDB seems to use the same id as the PDB (e.g., 4p5j).
The RCSB PDB maintains a weekly-updated, summary file named
pdb_entry_type.txt in pure text format (check here for a list of useful summary files), containing “List of all PDB entries, identification of each as a protein, nucleic acid, or protein-nucleic acid complex and whether the structure was determined by diffraction or NMR.” An excerpt of the file is shown below:
108m prot diffraction
109d nuc diffraction
109l prot diffraction
109m prot diffraction
10gs prot diffraction
10mh prot-nuc diffraction
110d nuc diffraction
110l prot diffraction
102m prot diffraction
103d nuc NMR
Specifically, a nucleic acid structure contains the (sub)string
nuc in the second field, where
prot-nuc means a protein-RNA/DNA complex. This text file is trivial to parse, and the atomic coordinates files (in PDB or PDBx/mmCIF format) for all nucleic acid structures can be automatically downloaded from the RCSB PDB using a script.
It is worth noting that DSSR is checked against all nucleic acid structures in the PDB at the time of each release to ensure that it does not crash. I update my local copy of nucleic acid structures each week, and run DSSR on the new entries. This process not only provides me an opportunity to keep pace with new developments in the field but also allows me to keep refining DSSR as needs arise.