It gives me great pleasure to announce that the 3DNA/DSSR project is now funded by the NIH R24GM153869 grant, titled "X3DNA-DSSR: a resource for structural bioinformatics of nucleic acids". I am deeply grateful for the opportunity to continue working on a project that has basically defined who I am. It was a tough time during the funding gap over the past few years. Nevertheless, I have experienced and learned a lot, and witnessed miracles enabled by enthusiastic users.
Since late 2020 when I lost my R01 grant, DSSR has been licensed by the Columbia Technology Ventures (CTV). I appreciate the numerous users (including big pharma) who purchased a DSSR Pro License or a DSSR Basic paid License. Thanks to the NIH R24GM153869 grant, we are pleased to provide DSSR Basic free of charge to the academic community. Academic Users may submit a license request for DSSR Basic or DSSR Pro by clicking "Express Licensing" on the CTV landing page. Commercial users may inquire about pricing and licensing terms by emailing techtransfer@columbia.edu, copying xiangjun@x3dna.org.
The current version of DSSR is v2.4.5-2024sep24 which contains miscellaneous bug fixes (e.g., chain id with > 4 chars) and minor improvements. This release synchronizes with the new R24 funding, which will bring the project to the next level. All existing users are encouraged to upgrade their installation.
Lots of exciting things will happen for the project. The first thing is to make DSSR freely accessible to the academic community. In the past couple of weeks, CTV have already issued quite a few DSSR Basic Academic licenses to users from all over the world. So the demand is high, and it will become stronger as more academic users become aware of DSSR. I'm closely monitoring the 3DNA Forum, and is always ready to answer users questions.
I am committed to making DSSR a brand that stands for quality and value. By virtue of its unmatched functionality, usability, and support, DSSR saves users a substantial amount of time and effort when compared to other options. My track record throughout the years has unambiguously demonstrated my dedication to this solid software product.
DSSR Basic contains all features described in the three DSSR-related papers, and includes the originally separate SNAP program (still unpublished) for analyzing DNA/RNA-protein complexes. The Pro version integrates the classic 3DNA functionality, plus advanced modeling routines, with email/Zoom/phone support.
The least-squares (LS) fitting procedures presented below make use of well known mathematics. Indeed, the methods are so well known and widely used that it is somewhat difficult to locate the original references. In our previous effort to resolve the discrepancies among nucleic acid conformational analysis programs, we came across a variety of LS fitting procedures. Here we provide a detailed description, with step-by-step examples, of our implementation in 3DNA of two LS fitting algorithms based on a covariance matrix and its eigen-system. This post is the revised version of a note first made available in the “Technical Details” section of earlier 3DNA websites.
LS fitting between standard and experimental bases
Three analysis schemes — CompDNA, Curves/Curves+, and RNA — use LS procedures to fit a standard base with an embedded reference frame to an observed base structure. CompDNA
and Curves/Curves+
take advantage of the conventional approach of McLachlan [“Least Squares Fitting of Two Structures.” J. Mol. Biol., 128, 74-79 (1979)], while the RNA
program implements a closed-form solution of absolute orientation using unit quaternions first introduced by Horn. The two algorithms are mathematically equivalent for the most general cases, since the unit quaternion can be transformed to the rotation matrix given by McLachlan. The Horn method, however, is more straightforward and generally applicable; it can be applied even when one or both of the structures are perfectly planar, whereas the McLachlan approach fails.
Here we use the ideal adenine geometry derived from the high resolution crystal structures of model nucleosides, nucleotides, and bases. The x-, y-, and z-coordinates of the standard base, taken from the NDB, are listed below in the columns labeled sx
, sy
, and sz
, respectively. s_(average)
is the geometric center of the base.
sx sy sz
1 N9 0.213 0.660 1.287
2 C4 0.250 2.016 1.509
3 N3 0.016 2.995 0.619
4 C2 0.142 4.189 1.194
5 N1 0.451 4.493 2.459
6 C6 0.681 3.485 3.329
7 N6 0.990 3.787 4.592
8 C5 0.579 2.170 2.844
9 N7 0.747 0.934 3.454
10 C8 0.520 0.074 2.491
------------------------------------
s_(average): 0.4589 2.4803 2.3778
We similarly describe the coordinates of one of the adenine bases (the fifth nucleotide in the sequence strand) from the high resolution (1.4 Å) self-complementary d(CGCGAATTCGCG) dodecamer duplex determined by Williams and co-workers (PDB id: 355d). The experimental xyz coordinates are listed below in the columns labeled ex
, ey
, and ez
. The geometric center is e_(average)
. Note that the atomic serial numbers from the PDB (first column) have been rearranged so that the atoms are in the same order as those of the ideal base listed above.
ex ey ez
91 N9 16.461 17.015 14.676
100 C4 15.775 18.188 14.459
99 N3 14.489 18.449 14.756
98 C2 14.171 19.699 14.406
97 N1 14.933 20.644 13.839
95 C6 16.223 20.352 13.555
96 N6 16.984 21.297 12.994
94 C5 16.683 19.056 13.875
93 N7 17.918 18.439 13.718
92 C8 17.734 17.239 14.207
------------------------------------
e_(average):16.1371 19.0378 14.0485
We collect the two sets of xyz coordinates in the 10 × 3 matrices S
and E
corresponding respectively to the standard and experimental bases. We then construct the 3 × 3 covariance matrix C
between S
and E
using the following formula:
1 1
C = ------- [S' E - --- S' i i' E]
n - 1 n
=
0.2782 0.2139 -0.1601
-1.4028 1.9619 -0.2744
1.0443 0.9712 -0.6610
Here n
, the number of atoms in each base, is 10, and i
is an n x 1 column vector consisting of only ones. S'
and i'
are the transpose of matrix S
and column vector i
, respectively.
From the nine elements of the C
matrix, we subsequently generate the 4 × 4 real symmetric matrix M
using the expression:
| c11+c22+c33 c23-c32 c31-c13 c12-c21 |
M = | c23-c32 c11-c22-c33 c12+c21 c31+c13 |
| c31-c13 c12+c21 -c11+c22-c33 c23+c32 |
| c12-c21 c31+c13 c23+c32 -c11-c22+c33 |
=
1.5792 -1.2456 1.2044 1.6167
-1.2456 -1.0228 -1.1890 0.8842
1.2044 -1.1890 2.3447 0.6968
1.6167 0.8842 0.6968 -2.9011
The largest eigenvalue of matrix M
is 4.0335, and its corresponding unit eigenvector is:
[ q0 q1 q2 q3 ] = [ 0.6135 -0.2878 0.7135 0.1780 ]
The rotation matrix R
is deduced from the above eigenvector as below:
| q0q0+q1q1-q2q2-q3q3 2(q1q2-q0q3) 2(q1q3+q0q2) |
R = | 2(q2q1+q0q3) q0q0-q1q1+q2q2-q3q3 2(q2q3-q0q1) |
| 2(q3q1-q0q2) 2(q3q2+q0q1) q0q0-q1q1-q2q2+q3q3 |
=
-0.0817 -0.6291 0.7730
-0.1923 0.7710 0.6072
-0.9779 -0.0990 -0.1839
Following coordinate transformation with matrix R
, the origin of the standard base is found to be displaced from the experimental structure by:
o = e_(average) - s_(average) R' = [15.8969 15.7701 15.1802]
The least-squares fitted coordinates (F
) of the standard base atoms on the experimental structure are then given by:
F = S R' + i o
=
16.4592 17.0194 14.6699
15.7747 18.1925 14.4586
14.4899 18.4519 14.7542
14.1729 19.6974 14.4070
14.9343 20.6404 13.8420
16.2222 20.3472 13.5569
16.9832 21.2875 12.9925
16.6829 19.0585 13.8760
17.9183 18.4437 13.7219
17.7335 17.2396 14.2062
Here S
is the (n x 3) matrix of original coordinates of the standard base, and as noted above, i
is an n x 1 column vector consisting of only ones.
The difference matrix (D
) between F
and E
, the (n x 3) matrix of original coordinates of the experimental base, and the root-mean-square (RMS) deviation between the two structures are found as:
D = E - F
=
0.0018 -0.0044 0.0061
0.0003 -0.0045 0.0004
-0.0009 -0.0029 0.0018
-0.0019 0.0016 -0.0010
-0.0013 0.0036 -0.0030
0.0008 0.0048 -0.0019
0.0008 0.0095 0.0015
0.0001 -0.0025 -0.0010
-0.0003 -0.0047 -0.0039
0.0005 -0.0006 0.0008
RMS deviation = 0.0054
It should be noted that if the standard base is already defined in terms of its reference frame, as in 3DNA (e.g., $X3DNA/config/Atomic_A.pdb
), the vector o
and the matrix R
represent the best-fitted coordinate frame of the experimental base. Moreover, the three axes of the frame given by R
are guaranteed to be orthonormal. If you want to get an insight of the LS fitting algorithm and a better understanding of how 3DNA derives its base reference frame, it’d be a valuable experience to repeat the above procedure with $X3DNA/config/Atomic_A.pdb
.
Note: the algorithm does not apply to a molecule vs its inversion (an improper rotation) — thanks to Boris Averkiev for reporting this subtle point (see comments below). One possible remedy is to treat this edge case separately.
Base normal
Rather than fit a standard base to experimental coordinates, the CEHS, FREEHELIX, and NUPARM analyses perform a fitting of a LS plane to a set of atoms in order to define the base and base-pair normals. The covariance matrix based on the n x 3 matrix of experimental Cartesian coordinates E
is diagonalized to find the vector normal to the best plane. Specifically, C
is obtained using the above formula with S
substituted by E
. The normal vector then lies along the eigenvector that corresponds to the smallest eigenvalue. Note that the coefficient 1/(n-1)
in the formula for calculating C
has no effect on the direction of the eigenvectors but scales the magnitudes of the eigenvalues.
Using the above adenine base from the high resolution dodecamer duplex as an example, the covariance matrix C
is:
C =
1.6680 -0.5015 -0.3253
-0.5015 2.0670 -0.5840
-0.3253 -0.5840 0.3061
The smallest eigenvalue of C
, 8.26e-5
, indicates that the base is almost perfectly planar. The corresponding unit eigenvector corresponding to the base normal is:
Base normal: 0.2737 0.3224 0.9062
Related topics:
As the old saying goes, a picture is worth a thousand words. To help you have a better idea of what 3DNA/DSSR is about, we’ve collected the following pictures; they serve to demonstrate selected features from 3DNA/DSSR’s versatile functionality.
Schematic diagram of base-pair parameters
Influence of Slide and Roll on DNA helical conformation
Roll-introduced DNA bending
Global bending of DNA associated with selective B → A conformational transformation
Canonical fiber models of A-, B-, C- and Z-DNA
3DNA-generated view of a four-way DNA–RNA junction (1egk)
3DNA-detected pentaplets in the large ribosomal subunit (1jj2)
Nucleic-acid-containing structures generated with w3DNA
Analysis of DNA with a B-Z junction (2acj, left) and detection of hydration patterns (right)
Schematics images auto-generated via blocview
Over the years, the fiber
utility program has become a handy way to generate standard B-DNA and A-DNA structures, as evident from citations to 3DNA. Nevertheless, the currently collected 55 experimental fiber models, comprehensive as they are, do not include one for canonical double-stranded (ds) RNA or single-stranded (ss) RNA structures of generic A/C/G/U sequence.
This situation is best illustrated by a recent article by Charles Brooks and Hashim Al-Hashimi and their co-workers, titled Unraveling the structural complexity in a single-stranded RNA tail: implications for efficient ligand binding in the prequeuosine riboswitch [Nucleic Acids Research, 40(3) 1345–1355 (2012)] , where they wrote:
Idealized A-form structures were constructed using Insight II (Molecular Simulations, Inc.) correcting the propeller twist angles from +15° to –15° using an in-house program, as previously described (47). The complementary strand was removed and the resulting ssRNA used in NMR data analysis. B-form helices were constructed using W3DNA (48).
As of 3DNA v2.1, however, that’s no longer the case: now the fiber
utility provides direct support for generating idealized dsRNA or ssRNA structures of arbitrary A/C/G/U sequence. As always, the new functionality can be best illustrated with examples. Let’s build ssRNAs of the wild-type (5’-AUAAAAAACUAA-3’) and A29C mutated form (5’-AUAACAAACUAA-3’) used in the work cited above:
fiber -r -s -seq=AUAAAAAACUAA wt-12nt.pdb
fiber -r -s -seq=AUAACAAACUAA mt-12nt.pdb
Here the -r
option is for RNA, -s
for a ss structure, and -seq
for the specific base sequence. The generated ssRNA structure for the wild-type sequence is named wt-12nt.pdb, and that for the mutated sequence named mt-12nt.pdb.
Note that the new RNA model is based on Struther Arnott’s work of fiber A-DNA from calf thymus (#1 in the list). The dsRNA, as its dsDNA counterpart, has a helical twist of 32.7° and a helical rise of 2.548 Å. Relevant to the above citation, here the propeller twist angle of each base pair is –10.5°, a negative value similar to that observed in high-resolution x-ray crystal structures. Furthermore, you can easily verify the three numbers with the following commands:
fiber -r -seq=AUAAAAAACUAA wt-12nt.pdb
find_pair wt-12nt.pdb stdout | analyze stdin
In summary, it is very easy to generate canonical RNA structures with the revised fiber
command. Through its integrated analysis routine, 3DNA can also be used to check structural features of the resultant RNA models. Moreover, as mentioned in the opening post What can 3DNA do for RNA structures? on the forum, 3DNA has much to offer in the filed of RNA structural bioinformatics.
At the C2B2 party this afternoon, I was asked the question: “Does 3DNA work for RNA?” Well, a good question, indeed. The short answer is definitely, YES. However, a detailed explanation is needed to address the underlying intuitive assumption: 3DNA is only for DNA.
- The name 3DNA was due to Dr. Olson, after we struggled quite a while. Initially, we played with NuStar (which was actually cited once by Richard Dickerson et al.), and Carnival etc. I still remember the day when Dr. Olson asked me “How about 3DNA?” We immediately reached an agreement: that’s it — what a cute name! Another advantage (as it becomes clear later): since 3DNA starts with ‘3’, it (mostly) shows up right at the top of many on-line lists of bioinformatics tools.
- Interpreted literally, 3DNA could mean 3-DNA, i.e., the three most common types of DNA: A-, B- and Z-form. That may be one of the reasons where the misconception that 3DNA is only for 3DNA comes from. Another reason could be that structural work on DNA is what the Olson lab best known for.
- The number ‘3’ in 3DNA should also be associated with its three key components: analysis, rebuilding and visualization. In a sense, this is my favorite.
- Of course, 3DNA stands for 3D-NA, 3-Dimensional Nucleic Acids, as expressed explicitly in the titles of our two 3DNA papers (2003 NAR and 2008 NP).
The applications of 3DNA to RNA structures can be broadly categorized as follows:
- Automatically detect all existing base-pairs, Watson-Crick (A-U, G-C, wobble G-U) or non-canonical, using a set of simple geometric criteria. Furthermore, it has a unique base-pair classification system based on the six numerical structural parameters, suitable for database storage and search.
- Automatically detect all triplets or higher-order base-associations.
- Automatically detect double helical regions, regardless of backbone connection, thus ideal for finding pseudo-continuous coaxial stacking.
- The above three features are seamlessly integrated with the visualization component to allow for easy generation of publication quality images. See the 3DNA 2008 NP paper for detailed examples.
As further examples, the following two RNA publications take advantage of find_pair
from 3DNA:
It is well worth noting that the base-pair detecting algorithm in RNAView is based on an earlier version of find_pair
, a basic fact ignored in the RNAView publication.
In summary, 3DNA works for RNA as well as for DNA, and more.
A video overview of DSSR
DSSR (Dissecting the Spatial Structure of RNA) is an integrated software tool for the analysis/annotation, model building, and schematic visualization of 3D nucleic acid structures (see the figures below and the video overview). It is built upon the well-known, tested, and trusted 3DNA suite of programs. DSSR has been made possible by the developer’s extensive user-support experience, detail-oriented software engineering skills, and expert domain knowledge accumulated over two decades. It streamlines tasks in RNA/DNA structural bioinformatics, and outperforms its ‘competitors’ by far in terms of functionality, usability, and support.
Wide citations. DSSR has been widely cited in scientific literature, including: (i) “Selective small-molecule inhibition of an RNA structural element” (Nature, 2015; Merck Research Laboratories), (ii) “The structure of the yeast mitochondrial ribosome” (Science, 2017), (iii) “RNA force field with accuracy comparable to state-of-the-art protein force fields” (PNAS, 2018; D. E. Shaw Research), (iv) “Predicting site-binding modes of ions and water to nucleic acids using molecular solvation theory” (JACS, 2019), (v) “RIC-seq for global in situ profiling of RNA-RNA spatial interactions” (Nature, 2020), and (vi) “DNA mismatches reveal conformational penalties in protein-DNA recognition” (Nature, 2020).
Broad integrations. To make DSSR as widely accessible as possible, I have initiated collaborations with the principal developers of Jmol and PyMOL. The DSSR-Jmol and DSSR-PyMOL integrations bring unparalleled search capabilities (e.g., ‘select junctions’ for all multi-branch loops) and innovative visualization styles into 3D nucleic acid structures. DSSR has also been adopted into numerous other structural bioinformatics resources, including: (i) URS, (ii) RiboSketch, (iii) RNApdbee, (iv) forgi, (v) RNAvista, (vi) VeriNA3d, (vii) RNAMake, (viii) ElTetrado, (ix) DNAproDB, (x) LocalSTAR3D, (xi) IPANEMAP, and (xii) RNANet.
Advanced features. DSSR may be licensed from Columbia University. DSSR Pro is the commercial version. It has more functionalities than DSSR basic (the free academic version), including: (i) homology modeling via in silico base mutations, a feature employed by Merck scientists, (ii) easy generation of regular helical models, including circular or super-helical DNA (see figures below), (iii) creation of customized structures with user-specified base sequences and rigid-body parameters, (iv) efficient processing of molecular dynamics (MD) trajectories, (v) detailed characterization of DNA-protein or RNA-protein spatial interactions, and (vi) template-based modeling of DNA-protein complexes (see figures below). DSSR Pro supersedes 3DNA. It integrates the disparate analysis and modeling programs of 3DNA under one umbrella, and offers new advanced features, through a convenient interface. For example, with the mutate module of DSSR Pro, one can automatically perform the following tasks: (i) mutate all bases to Us, (ii) mutate bases in hairpin loops to Gs, and (iii) mutate G–C Watson-Crick pairs to C–G, and A–U to U–A. Moreover, DSSR Pro includes an in-depth user manual and one-year technical support from the developer.
Quality control. DSSR is a solid software product that excels in RNA structural bioinformatics. It is written in strict ANSI C, as a single command-line program. It is self-contained, with zero runtime dependencies on third-party libraries. The binary executables for macOS, Linux, and Windows are just ~2MB. DSSR has been extensively tested using all nucleic-acid-containing structures in the PDB. It is also routinely checked with Valgrind to avoid memory leaks. DSSR requires no set up or configuration: it simply works.
Theoretical models of G-quadruplexes, created using DSSR Pro.
Template-based modeling of DNA-protein complexes using DSSR Pro.
Here are two chromatin-like models using PDB entry 4xzq as the template.
Circular DNA duplexes modeled using DSSR Pro.
DNA super helices modeled using DSSR Pro.
Innovative cartoon-block schematics enabled by the DSSR-PyMOL integration for six representative PDB entries. Watson-Crick pairs are shown as long blocks with minor-groove edges in black (A, B), G-tetrads represented as square blocks and the metal ion as sphere ©, the ligand rendered as balls-and-sticks (D), and proteins depicted as purple cartoons (E, F). Color code for base blocks: A, red; C, yellow; G, green; T, blue; U, cyan; G-tetrad, green; WC-pairs, per base in the leading strand. Visit http://skmatic.x3dna.org.
Recommended in Faculty Opinions: “simple and effective”, “Good for Teaching”.
Employed by the NDB to create cover images of the RNA Journal.
The following links point to tools that are relevant to 3DNA.
- Curves+ — an updated version of the well-known Curves program, and it conforms to the standard base reference frame.
- 3D-DART — 3DNA-Driven DNA Analysis and Rebuilding Tool. Another web-interface to commonly used 3DNA functionality.
- do_x3dna — “do_x3dna has been developed for analysis of the DNA/RNA dynamics during the molecular dynamics simulations. It uses the 3DNA package to calculate several structural descriptors of DNA/RNA from the GROMACS MD trajectory. It executes 3DNA tools to calculate these descriptors and subsequently, extracts these output and saves into external output files as a function of time.”
- SwS — a Solvation web Service for Nucleic Acids where 3DNA plays a role.
- Raster3D — a set of tools for generating high-quality raster images of proteins or other molecules.
- MolScript — a program for displaying molecular 3D structures, such as proteins, in both schematic and detailed representations.
- Jmol — an open-source Java viewer for chemical structures in 3D with features for chemicals, crystals, materials, and biomolecules.
- PyMOL — a user-sponsored molecular visualization system on an open-source foundation.
- ImageMagick — a software suite to create, edit, compose, or convert bitmap images.
- NDB — Nucleic acids database.
- SBGrid — Excellent services for structural biology laboratories as well software developers.
The v2.1 release of 3DNA, currently in beta, contains many refinements of existing C programs, a complete migration from Perl scripts to Ruby, and additions of several significant new programs. All know bugs in v2.0 have been fixed. Highlights include:
- Added mutate_bases to perform in silico base mutations in nucleic-acid-containing structures (DNA, RNA, and their complexes with ligands and proteins). The program has two key and unique features: (1) the sugar-phosphate backbone conformation is untouched; (2) the base reference frame (position and orientation) is reserved, i.e., the mutated structure shares the same base-pair/step parameters as those of the native structure.
- Added
x3dna_ensemble
, a Ruby script to automate the processing of an NMR structure ensemble or MD trajectories in MODEL/ENDMDL delineated PDB format. It has sub-commands analyze
, extract
, reorient
, and block_iamge
. To add: convert
to transform Amber, Gromacs or CHARMM trajectories.
- Enhanced
find_pair
with -c+
option for generating input to Curves+.
- Expanded
fiber
with the -s
option for generating single stranded structures; the -seq
option for specifying base sequence directly on the command line; and the -r
option for generating RNA structures (single or double stranded) of arbitrary ACGU sequences.
- Updated the ‘baselist.dat’ file to incorporate all types of NDB/PDB nucleotides as of February 15, 2015; refined
find_pair/analyze/mutate_bases
etc to automatically detect and assign of modified bases.
- Renamed Atomic_a.pdb and Atomic.a.pdb etc for modified bases to account for Mac OS X filesystem case sensitivity issue; Copied all Perl scripts to a new directory
perl_scripts/
.
- 3DNA now generates PDB files that are compliant with PDB format v3.x, and also has option to allow for three-letter nucleotide names, thus directly compatible with PdbViewer and HADDock. An option is provided to convert 3DNA-generated base rectangular blocks in Alchemy to the more widely accepted MDL molfile format (e.g. by PyMOL).
Recently, I (together with Drs. Wilma Olson and Harmen Bussemaker – a team with a unique combination of complementary expertise) published a new article in Nucleic Acids Research (NAR): The RNA backbone plays a crucial role in mediating the intrinsic stability of the GpU dinucleotide platform and the GpUpA/GpA mini duplex. The key findings of this work are summarized in the abstract:
The side-by-side interactions of nucleobases contribute to the organization of RNA, forming the planar building blocks of helices and mediating chain folding. Dinucleotide platforms, formed by side-by-side pairing of adjacent bases, frequently anchor helices against loops. Surprisingly, GpU steps account for over half of the dinucleotide platforms observed in RNA-containing structures. Why GpU should stand out from other dinucleotides in this respect is not clear from the single well-characterized H-bond found between the guanine N2 and the uracil O4 groups. Here, we describe how an RNA-specific H-bond between O2’(G) and O2P(U) adds to the stability of the GpU platform. Moreover, we show how this pair of oxygen atoms forms an out-of-plane backbone ‘edge’ that is specifically recognized by a non-adjacent guanine in over 90% of the cases, leading to the formation of an asymmetric miniduplex consisting of ‘complementary’ GpUpA and GpA subunits. Together, these five nucleotides constitute the conserved core of the well-known loop-E motif. The backbone-mediated intrinsic stabilities of the GpU dinucleotide platform and the GpUpA/GpA miniduplex plausibly underlie observed evolutionary constraints on base identity. We propose that they may also provide a reason for the extreme conservation of GpU observed at most 5’-splice sites.
As a nice surprise, this publication was selected by NAR as a featured article! According to the NAR website:
Featured Articles highlight the best papers published in NAR. These articles are chosen by the Executive Editors on the recommendation of Editorial Board Members and Referees. They represent the top 5% of papers in terms of originality, significance and scientific excellence.
I feel very gratified with the “extra” recognition. From my own perspective, I can easily rank this paper as the top one in my publication list: from the very beginning, I has been struck by the simplicity and elegance of the GpU story. Hopefully, time will verify the validity of this scientific contribution.
Behind the hood, though, there is a long, complex (sometimes perplexing), yet interesting story associated with this work. Here is how it got started. While writing the 3DNA 2008 Nature Protocols (NP) paper, I selected the (previously undocumented) ‘-p’ option of find_pair
to showcase its capability to identify higher-order base associations, using the large ribosomal subunit (1jj2) as an example. I noticed the unexpected O2’(G)⋅⋅⋅O2P H-bond within the GpU dinucleotide platform in a pentaplet (Figure A below). I was/am well aware of Leontis-Westholf’s pioneering work on Geometric nomenclature and classification of RNA base pairs which involves three distinct edges – the Watson-Crick edge, the Hoogsteen edge, and the Sugar edge, yet without taking into consideration of possible sugar-phosphate backbone interactions (Figure B below). So I decided to double-check, just to be sure that the H-bond was not spurious due to defects in the H-bond detecting scheme of find_pair
, and the finding was very surprising.
The following section was re-added into the 3DNA NP paper in the very last revision:
It is also worth noting that the G1971–U1972 platform is stabilized not only by the well-characterized G(N2)⋅⋅⋅U(O4) H-bond interaction, but also by a little-noticed G(O2’)⋅⋅⋅U(O2P) sugar-phosphate backbone interaction (Fig. 6a). Examination of the 50S large ribosomal unit (1JJ2) alone reveals ten such double H-bonded G–U platforms, far more occurrences than those registered by any other dinucleotide platform (including A–A) in this structure. Apparently, the G–U platform is more stable than other platforms with only a single base–base H-bond interaction. We are currently investigating this overrepresented G–U dinucleotide platform in other RNA structures. (p.1226)
See also Is the O2’(G)…O2P H-bond in GpU platforms real?
Structural analysis of nucleic acids used to be a rather tedious process, especially for irregular, complicated RNA structures and nucleic-acid/protein complexes [e.g., the large ribosomal subunit of H. marismortui (1jj2)]. Without valid base-pairing information arranged properly in a duplex fragment as input, analysis programs such as Curves+
and analyze/cehs
in 3DNA would produce meaningless results. The program find_pair
in 3DNA was originally created to solve this specific problem, i.e., to generate an input file to 3DNA analysis routines directly from a nucleic-acid containing structure in PDB format. It is what makes nucleic acids structural analysis a routine process — running through thousands of structures from NDB/PDB can be fully automated.
Overall, find_pair
has more than fulfilled the goal of its initial design (as stated above). Over the past few years, its functionality has been expanded and continuously refined (kaizen 改善), making find_pair
itself a full-featured application. Now, it is efficient, robust, and its simple command line interface allows for easy integration with other bioinformatics tools. Properly acknowledged or otherwise, find_pair
has served (at least) as one of the key components in many other applications (RNAView
, BPS
, SwS
, ARTS
, to name just a few). Indeed, find_pair
is by far the single program in 3DNA that has received the most questions (as evident from the 3DNA forum).
While I still have to write a method paper to describe the underlying algorithms of find_pair in detail — i.e., for identifying nucleotides, H-bonds, base pairs, high-order base associations, and double helical regions — the basic idea is intuitive and very easy to understand: as summarized in our recent GpU paper”, find_pair
is purely geometric based (with user adjustable parameters) and allows for the identification of canonical Watson–Crick as well as non-canonical base pairs, made up of normal or modified bases, regardless of tautomeric or protonation state. For example, in the GpU paper”, we chose the following set of stringent parameters to ensure that the geometry of each identified base pair is nearly planar and supports at least one inter-base H-bond: (i) a vertical distance (stagger) between base planes ≤ 1.5 Å; (ii) an angle between base normal vectors ≤ 30°; and (iii) a pair of nitrogen and/or oxygen base atoms at a distance ≤ 3.3 Å. Other criteria (documented or otherwise), such as the distance between the origins of the two standard base reference frames, are just filters to speed up the calculations.
In a nutshell, find_pair has the following two core functionalities:
- The default is to generate input to the analysis routines in 3DNA (
analyze/cehs
) for double helices. However, there are many more job to perform under the hood than just identifying base pairs: the base pairs must be in proper sequential order, and each strand must be in 5’ to 3’ direction, for the calculated step parameters (twist, roll etc) to make sense. Moreover, with the “-c” option, one gets an input file to Curves
(but not Curves+, yet); with the “-s” or “-1” option, find_pair
treats the whole structure as one single strand, and is useful for getting all backbone torsion angles.
- Detect all base pairs (regardless of double helical regions) and higher-oder (3+) base associations with the “-p” option. This feature (in its preliminary form) was there starting from at least v1.5, which was released at the end of 2002 (just before I left Rutgers), but it was intentionally undocumented. The source code of
find_pair
(as part of 3DNA) was tested and shared within Rutgers (NDB and Dr. Olson’s laboratory) before any 3DNA paper was published, and served as the basis for several other projects. We also offered 3DNA (with source code) to a few RNA experts for comments; but we received either no responses or politely-worded negative ones. Things did not work out as (what I thought) they should have been, but that’s life and I have learned my lessons. The “-p” option was first explicitly mentioned in the 3DNA 2008 Nature Protocols paper, to illustrate how to identify the two pentaplets in the large ribosomal subunit of H. marismortui (1jj2).
It is interesting to mention the two papers I’ve recently come across: the first is on DNA-protein interactions and the second on RNA base-pairing, where new algorithms were developed to detect base pairs and their performances were compared with find_pair
. In each of the two cases, it was claimed that find_pair
missed certain pairs where the new methods succeeded. As it turned out, however, in the first case, simply relaxing find_pair
’s default H-bond distance cut-off 4.0 Å to 4.5 Å, as used by the authors, virtually all the missing pairs were recovered. In the second case, the “-p” option, which should have been, was simply not specified.
After nearly a decade of extensive real-world applications and refinements, it is safe to say that find_pair is now a versatile and practical tool for nucleic acids structure analysis. Of course, I will continue to support and further refine find_pair
as I see fit. Once in a while, I just cannot stop but to think that find_pair
is to nucleic acids what DSSP
is to proteins: simple and elegant. As more people become aware of its existence, I would expect find_pair
to gain even more widespread usage, especially in RNA-structure related research areas.