[Job] Staff Associate II (Computational Structural Biology) at Columbia University

The X3DNA-DSSR resource is at the forefront of structural bioinformatics, developing advanced tools for analyzing and modeling nucleic acid structures. We are seeking a highly motivated Staff Associate II to join our team and contribute to our next-generation analysis and visualization engine.

To see our resource in action, please visit wDSSR, our new web interface for dissecting and modeling 3D nucleic acid structures: https://web.x3dna-dssr.org/.

We are looking for a candidate with a strong scientific background in structural biology or bioinformatics and a desire to contribute to peer-reviewed publications through community-driven data analysis. We value individuals who are eager to learn, adapt to new technical challenges, and support the global research community.

For the full job description and to submit your application, please visit the official Columbia University posting: https://apply.interfolio.com/183705

Announcing wDSSR: The Next-Generation Web Interface to X3DNA-DSSR

Dear 3DNA/DSSR Community,

We are thrilled to announce the official launch of wDSSR (https://web.x3dna-dssr.org/), the powerful new web interface to the X3DNA-DSSR analytical engine.

Developed by Drs. Shuxiang Li and Xiang-Jun Lu and supported by NIH grant R24GM153869, wDSSR represents a major leap forward from our highly popular 2019 Web 3DNA 2.0 framework. While Web 3DNA 2.0 has faithfully served the community for the analysis, visualization, and modeling of 3D nucleic acid structures, wDSSR was built from the ground up to take full advantage of modern web technologies and the latest DSSR backend capabilities.

A Modern, Streamlined Scientific Workflow We have completely overhauled the user interface to provide a clean, intuitive, and task-driven experience. The core modeling and analysis tools are now seamlessly organized into a logical, single-word scientific workflow: Analyze, Rebuild, Model, Circularize, Mutate, Assemble, and Visualize.

Spotlight Feature: The "Assemble" Module One of the most exciting upgrades is the newly renamed Assemble tab (formerly "Composite"). This advanced composite model builder allows you to effortlessly construct complex, higher-order models by linking any combination of nucleic acid duplexes or protein-DNA/RNA complexes. You can quickly connect up to six distinct target structures, ranging from simple linked A-DNA and B-DNA duplexes to large, protein-decorated structural assemblies.

Immediate Global Adoption Although wDSSR has just launched, we are incredibly humbled to share that it is already seeing rapid worldwide adoption! According to recent network infrastructure data, the new interface is actively being used by researchers across North America, South America, Europe, and Asia. Within just a few days, we have recorded active sessions from prestigious institutions around the globe, including:

The Weizmann Institute of Science in Israel
Katholieke Universiteit Leuven in Belgium
Queen's University in Canada
Universidad Nacional Autonoma de Mexico (UNAM) in Mexico
Emory University and the Wadsworth Centers Laboratories and Research in the United States
Jawaharlal Nehru University and the China Education and Research Network in Asia

How to Cite While a dedicated paper for wDSSR is currently in preparation, researchers should cite the server using its URL (https://web.x3dna-dssr.org/) alongside the 2019 Web 3DNA 2.0 paper and the foundational 2015 DSSR paper. Full details and funding acknowledgements can be found on our newly consolidated About page.

We invite you all to try out the new wDSSR platform! As always, your feedback is invaluable to us, and we encourage you to share your thoughts, questions, and structural models via the newly updated Questions & Feedback link in the wDSSR footer.

Happy modeling!

The article on G.A pairs in ACS Biochemistry

After many years of efforts, it is a great pleasure to see our paper Effects of Noncanonical Base Pairing on RNA Folding: Structural Context and Spatial Arrangements of G·A Pairs published in ACS Biochemistry. The abstract is shown below:

Noncanonical base pairs play important roles in assembling the three-dimensional structures critical to the diverse functions of RNA. These associations contribute to the looped segments that intersperse the canonical double-helical elements within folded, globular RNA molecules. They stitch together various structural elements, serve as recognition elements for other molecules, and act as sites of intrinsic stiffness or deformability. This work takes advantage of new software (DSSR) designed to streamline the analysis and annotation of RNA three-dimensional structures. The multiscale structural information gathered for individual molecules, combined with the growing number of unique, well-resolved RNA structures, makes it possible to examine the collective features deeply and to uncover previously unrecognized patterns of chain organization. Here we focus on a subset of noncanonical base pairs involving guanine and adenine and the links between their modes of association, secondary structural context, and contributions to tertiary folding. The rigorous descriptions of base-pair geometry that we employ facilitate characterization of recurrent geometric motifs and the structural settings in which these arrangements occur. Moreover, the numerical parameters hint at the natural motions of the interacting bases and the pathways likely to connect different spatial forms. We draw attention to higher-order multiplexes involving two or more G·A pairs and the roles these associations appear to play in bridging different secondary structural units. The collective data reveal pairing propensities in base organization, secondary structural context, and deformability and serve as a starting point for further multiscale investigations and/or simulations of RNA folding.

This work represents a multifaceted, fundamental application enabled by DSSR. Even at the base-pair (bp) level, DSSR provides unique features that complement the Leontis-Westhof (LW) notation of 12 geometric types.

At the review stage, we were asked by a referee to comment on the differences between DSSR and LW on bp classifications. The following paragraph in the “DISCUSSION” section of the paper is our response, expanded on the original writing that focused on DSSR’s capabilities:

Qualitative descriptions of noncanonical RNA base pairing, pioneered by Leontis and Westhof^9,41 and linked in this work to the rigid-body parameters of interacting bases, have proven valuable in deciphering the connections between RNA primary, secondary, and tertiary structures. The present categorization is based on the positions of the hydrogen-bonded atoms with respect to a standard, embedded base reference frame³⁰ defined in terms of an idealized Watson−Crick base pair. The major- and minor-groove base edges used here correspond in most cases to what are termed the Hoogsteen and sugar edges in the Leontis−Westhof scheme (one can compare the two classification schemes in Table S2). The + and − symbols introduced in 3DNA²⁴ and DSSR²⁷ unambiguously distinguish the relative orientations of the two bases. The trans and cis designations used in the earlier literature, however, are qualitative in nature and often uncertain. There are many “nc” (near cis, as in ncWW) and “nt” (near trans, as in ntSH) annotations listed in the RNA Structure Atlas; see, for example, the base-pair interactions in the sarcin−ricin domain of E. coli 23S rRNA found by entering PDB entry 1msy at http://rna.bgsu.edu/rna3dhub/pdb. The assignment of qualitative descriptors of RNA associations on the basis of atomic identity alone is generally not clear-cut. Numerical differences in the rigid-body parameters are critical to differentiating pairing schemes that share a common hydrogen bond, e.g., the G(N3)···A(N6) interaction found in m−W_{II and m−M_I arrangements of G and A (Table 1 and Figures 4 and S3). The numerical data also provide a basis for following conformational transitions and may potentially be of value in making functional and other meaningful distinctions among RNA base pairs.}

See also a recent thread Noncanonical base pair standards on the 3DNA Forum and the section titled “3.2.2 Base pairs” in the DSSR User Manual.

Comment

Web 3DNA 2.0 paper published in NAR

It is a great pleasure to announce the publication of Web 3DNA 2.0 for the analysis, visualization, and modeling of 3D nucleic acid structures in Nucleic Acids Research (NAR). The paper will appear in the web server issue of NAR in July 2019. At nine-page in length and with several new structural parameters, this w3DNA 2.0 paper is certainly not a typical NAR web-server publication. It represents a significant contribution to the field of 3D nucleic acids structural bioinformatics, and will undoubtedly push the popularity of 3DNA to a new level.

The abstract is shown below:

Web 3DNA (w3DNA) 2.0 is a significantly enhanced version of the widely used w3DNA server for the analysis, visualization, and modeling of 3D nucleic-acid-containing structures. Since its initial release in 2009, the w3DNA server has continuously served the community by making commonly-used features of the 3DNA suite of command-line programs readily accessible. However, due to the lack of updates, w3DNA has clearly shown its age in terms of modern web technologies and it has long lagged behind further developments of 3DNA per se. The w3DNA 2.0 server presented here overcomes all known shortcomings of w3DNA while maintaining its battle-tested characteristics. Technically, w3DNA 2.0 implements a simple and intuitive interface (with sensible defaults) for increased usability, and it complies with HTML5 web standards for broad accessibility. Featurewise, w3DNA 2.0 employs the most recent version of 3DNA, enhanced with many new functionalities, including: the automatic handling of modified nucleotides; a set of ‘simple’ base-pair and step parameters for qualitative characterization of non-Watson–Crick double- helical structures; new structural parameters that integrate the rigid base plane and the backbone phosphate group, the two nucleic acid components most reliably determined with X-ray crystallography; in silico base mutations that preserve the backbone geometry; and a notably improved module for building models of single-stranded RNA, double- helical DNA, Pauling triplex, G-quadruplex, or DNA structures ‘decorated’ with proteins. The w3DNA 2.0 server is freely available, without registration, at http://web.x3dna.org.

Moreover, details on reproducing our reported results are available in a dedicated section ‘web 3DNA 2.0 (http://web.x3dna.org)’ on the 3DNA Forum.

Figure 1. Summary of web 3DNA 2.0
Figure 2. New structural parameters
Figure 3. Fiber models and mutations
Supplemental PDF
Graphical abstract (taken from Fig. 1E):

Comment

3DNA blocview image in the cover of the RNA journal

While browsing the June 2019 issue of the RNA journal, I was surprised to see a cover image with familiar schematic representations:

The caption is as below:

Crystal structure of ykoY-mntP riboswitch chimera bound to cadmium (Protein Data Bank code: 6cc3; Bachas ST, Ferré-D’Amaré AR. 2018. Convergent use of heptacoordination for cation selectivity by RNA and protein metalloregulators. Cell Chem Biol 25: 962–973.e5). The RNA backbone is displayed as a red ribbon; bases are shown as blocks with NDB coloring: A—red, C—yellow, G—green, U—cyan; cadmium ions are shown as red spheres. The image was generated using 3DNA/blocview and PyMol software. Cover image provided by the Nucleic Acid Database (ndbserver.rutgers.edu).

In addition to the blocview script distributed with 3DNA v2.x, the block-view has been integrated into DSSR via the --blocview option. Notably, the DSSR-plugin introduces the dssr_block command to PyMOL for interactive visualization of nucleic acid structures. See the DSSR User Manual for more information.

Comment

DSSR on PDB entry 6neb

Via PDB weekly update, I recently came across PDB entry 6neb, which is solved by NMR and described as an “MYC promoter G-quadruplex with 1:6:1 loop length”. I downloaded the atomic coordinates of the entry and ran DSSR on it. Indeed, DSSR readily identifies a three-layered parallel G-quadruplex (G4) with three propeller-type loops of 1, 6 and 1 nucleotides (i.e., 1:6:1), as shown below.

List of 1 G4-stem
  Note: a G4-stem is defined as a G4-helix with backbone connectivity.
        Bulges are also allowed along each of the four strands.
  stem#1[#1] layers=3 INTRA-molecular loops=3 descriptor=3(-P-P-P) note=parallel(4+0) UUUU parallel
   1  glyco-bond=---- groove=---- WC-->Major nts=4 GGGG A.DG3,A.DG7,A.DG16,A.DG20
      pm(>>,forward)  area=14.54 rise=3.36 twist=24.7
   2  glyco-bond=---- groove=---- WC-->Major nts=4 GGGG A.DG4,A.DG8,A.DG17,A.DG21
      pm(>>,forward)  area=9.67  rise=3.48 twist=30.3
   3  glyco-bond=---- groove=---- WC-->Major nts=4 GGGG A.DG5,A.DG9,A.DG18,A.DG22
    strand#1  U DNA glyco-bond=--- nts=3 GGG A.DG3,A.DG4,A.DG5
    strand#2  U DNA glyco-bond=--- nts=3 GGG A.DG7,A.DG8,A.DG9
    strand#3  U DNA glyco-bond=--- nts=3 GGG A.DG16,A.DG17,A.DG18
    strand#4  U DNA glyco-bond=--- nts=3 GGG A.DG20,A.DG21,A.DG22
    loop#1 type=propeller strands=[#1,#2] nts=1 A A.DA6
    loop#2 type=propeller strands=[#2,#3] nts=6 TTTTAA A.DT10,A.DT11,A.DT12,A.DT13,A.DA14,A.DA15
    loop#3 type=propeller strands=[#3,#4] nts=1 T A.DT19

I then read the associated paper titled Solution Structure of a MYC Promoter G-Quadruplex with 1:6:1 Loop Length lately published in the new, open-access ACS Omega journal. The reported structure 6neb has a 27-nt sequence (termed Myc1245) of bases 5’-TTGGGGAGGGTTTTAAGGGTGGGGAAT-3’. Myc1245 is based on the 27-nt long, purine-rich MycPu27 which has 5 tracts of guanines of G4-forming motif within the MYC promoter. In Myc1245, the third G-tract of MycPu27 has been replaced by TTTA, thus it uses only G-tracts 1, 2, 4, 5 for G4 formation. Previously, it was shown that Myc2345 (using G-tracts 2-5 of MycPu27) adopts a parallel G4 structure with three propeller loops of 1:2:1 nt length.

The MycPu27 sequence is representative of the G4-forming nuclease hypersensitive element (NHE III1) within the promoter region of the MYC oncogene. Formation of G4 structures suppresses MYC transcription, thus ligand-induced G4 stabilization in the DNA level is a promising strategy for cancer therapy. The NHE III1 motif can fold into multiple G4 structures depending on factors such as protein binding. The paper on 6neb illustrates that nucleolin, a protein shown to bind MYC G4 and repress MYC transcription, preferably binds the 1:6:1 loop length conformer than the 1:2:1 conformer (the major form under physiological conditions).

The DSSR analysis of 6neb shows that the two G-tetrad steps have different overlapping areas and twist angles. The top step comprising G3 and G4 (Fig. 1A) has better stacking interactions (14.5 Å²) and smaller twist (25º) than the bottom step containing G4 and G5 (9.7 Å² and 30º, respectively).

area=14.54 rise=3.36 twist=24.7
area=9.67  rise=3.48 twist=30.3

The analysis characterizes the T1–A15 pair as a reverse Hoogsteen pair (rHoogsteen), which is distinct from the T+A Hoogsteen pair. In DSSR, the rHoogsteen pair is of type M–N (anti-parallel), whilst the Hoogsteen pair is of type M+N (parallel). With the local base-reference frames attached (Fig. 1B), it is easy to visualize that the z-axis of T1 is pointing out of the base-pair plane, and the z-axis of A15 is pointing inwards. See also my blog post Hoogsteen and reverse Hoogsteen base pairs.

Fig. 1C shows the ATG-triad automatically identified by DSSR. As is clear in Fig. 1A, the ATG-triad stacks on the 3-layered G4 structure on the 5’ side. Moreover, with color-coded base blocks (G in green, T in blue, and A in red), the two stacks (T10–T11, and T12–T13–A14) in the 6-nt central propeller loop is immediately obvious.

Figure 1. DSSR-derived structural features in PDB entry 6neb. The images were created using DSSR and PyMOL.

In the 6neb paper, the author stated that “The central loop of 6 nt connects the outer tetrads by spanning the G-core. A single nucleotide is the minimal length of this structural motif, so the five additional residues can significantly increase the loop’s conformational flexibility.” (p.2536) It is worth noting that in PDB entry 2m53, described in G-rich VEGF aptamer with locked and unlocked nucleic acid modifications exhibits a unique G-quadruplex fold, it was observed that:

An unprecedented all parallel-stranded monomeric G-quadruplex with three G-quartet planes exhibits several unique structural features. Five consecutive guanine residues are all involved in G-quartet formation and occupy positions in adjacent DNA strands, which are bridged with a no-residue propeller-type loop.

The G4 structure is polymorphic. It seems every imaginable or even unexpected form is possible, depending on the context.

Comment

Non-G base tetrads

In addition to the well-known G-tetrad serving as the building block of G-quadruplexes (G4), other types of homogeneous or heterogeneous base-tetrads are also possible. In DSSR, all these base tetrads are generally termed multiplets where three or more bases associate in a co-planar fashion via H-bonding interactions.

In the context of G4 structures, U-tetrads are the most common. Fig. 1A shows an example of U-tetrads in PDB entry 4rne reported in the paper titled Structural Variations and Solvent Structure of r(UGGGGU) Quadruplexes Stabilized by Sr²⁺ Ions.. In the structure (Fig. 1B), two terminal U-tetrads cap the six-layered G4 structure in the middle. The four U’s in the U-tetrad are paired in parallel orientation (i.e., U+U), just as the G+G pairs in the G-tetrad of G4 structures (Fig. 1C). On the other hand, there is only one H-bond (O4…N3) in the U+U pair of the U-tetrad, in contrast to the two H-bonds in the G+G pair of the G-tetrad (Fig. 1C). In the PDB entry 4rne, DSSR also detects two octads where the middle G-tetrad is surrounded by four U’s in anti-parallel orientation (G’s filled in green vs. U’s empty, see Fig. 1C for an example).

Similarly orientated C-tetrad (C+C pair, Fig. 1D) or A-tetrad (A+A pair, Fig. 1E) are also possible. PDB entry “6a85”, associated with the paper High-resolution DNA quadruplex structure containing all the A-, G-, C-, T-tetrads., reported a high-resolution crystal structure of sequence 5'-AGAGAGATGGGTGCGTT-3' which contains all the homogeneous A-, G-, C-, T-tetrads, and the heterogeneous A:T:A:T tetrads. As of this writing (Feb. 19, 2019), the status for the PDB entry “6a85” is still “HPUB” (‘processing complete, entry on hold until publication’) even though the paper was published several months ago. Using mutate_bases in 3DNA, I generated a C-tetrad and an A-tetrad as shown in Fig. 1D and 1E. As the U-tetrad (Fig. 1A), the C- and A-tetrads also have only one H-bond in their M+N type pairs. The G-tetrad, with two H-bonds in its connecting pairs, is more stable than the other homogeneous base tetrads, leading to wide-spread G4 structures.

In the homogeneous base tetrads shown in Fig.1A-E, pairs are of the parallel M+N type and the bases are associated via their Watson-Crick and major-groove (Hoogsteen) edges. Two canonical (WC or G—U wobble) pairs can also associate via their minor-groove edges, as seen in PDB entries 2hk4 and 2lsx. Fig. 1F gives an example with two G—U wobble pairs (of anti-parallel M—N type, filled U in blue vs. empty G) in PDB entry 2lsx reported in the paper titled A minimal i-motif stabilized by minor groove G:T:G:T tetrads..

Figure 1. Non-G base tetrads automatically identified or modeled by 3DNA-DSSR. The images were created using DSSR and PyMOL.

Comment

G-tetrad and pseudo G-tetrads

A G-quadruplex (G4) is composed of stacks of G-tetrad where four guanines form four G•G pairs in a circular, planar fashion. Specifically, the G•G pairs of the G-tetrad (see Fig. 1A below) in G4 are of type M+N according to 3DNA/DSSR: i.e., G+G with the local z-axes of pairing guanines in parallel. Moreover, the G+G pair is uniquely quantified by three base-pair parameters: Shear, Stretch, and Opening with mean values [+1.6 Å, +3.5 Å, –90º] or [–1.6 Å, –3.5 Å, +90º], corresponding to the cWH (cW+M) or cHW (cM+W) types of LW (DSSR) classifications, respectively. This pair is numbered VI in the list of 28 base pairs with two or more H-bonds between base atoms, compiled by Saenger.

In addition to the standard G-tetrad configuration as normally seen in G4 structures, a so-called pseudo-G-tetrad form (see Fig. 1B below) is reported in a 2013 paper titled Duplex-quadruplex motifs in a peculiar structural organization cooperatively contribute to thrombin binding of a DNA aptamer. (PDB entry 4i7y). In a 2017 publication from the same group, Through-bond effects in the ternary complexes of thrombin sandwiched by two DNA aptamers, another form of pseudo-G-tetrad (Fig. 1C) is found in PDB entries 5ew1 and 5ew2.

Clearly, pseudo-G-tetrads are very different from the normal G-tetrad, in terms of base pairing patterns. The G-tetrad is highly regular with the same type of G+G pairs, with the O6 atoms pointing to the middle of the circle. The two pseudo-G-tetrads are less regular, and they differ from each other as well, by flipping G12 from syn (Fig. 1B) to trans (Fig. 1C).

These distinctions stand out even more by filling the up-face (+z-axis outwards) of a guanine base in green while leaving the down-face (+z-axis inwards) empty (G5 in Fig. 1B, G5 and G12 in Fig. 1C). So in G-tetrad (Fig. 1A), all four guanines have their positive z-axis point towards the viewer, corresponding to all four G+G pairs. In one pseudo-G-tetrad (Fig. 1B), G5 has its positive z-axis pointing away from the viewer. So G5–G7 and G5–G16 pairs are of the M–N type. The other type of pseudo-G-tetrad (Fig. 1C) has the opposite orientation for G12. Finally, Fig. 1D shows schematically PDB entry 4i7y where the G-tetrad and a pseudo-G-tetrad are directly stacked, creating a two-layered pseudo-G-quadruplex.

Figure 1. (A) G-tetrad, (B-C) two types of pseudo-G-tetrads, and (D) the complex of a DNA-apatmer with thrombin. G-tetrads were automatically identified by 3DNA-DSSR. The images were created using DSSR and PyMOL.

Comment

Starting DNA or RNA structures

A starting structure of suitable sequence is a prerequisite for many applications, including downstream use in X-ray crystallography, NMR, and molecular dynamics (MD) simulations. Browsing through the literature, I’ve noticed the following tools for such a purpose.

The make-na server, a web-based automated tool for making nucleic acid helices powered by NAB. It supports abasic sites via the underscore character (_). According to the help page, “The structure file represents the abasic as the 3-letter code ‘3DR’ in DNA strands and ‘ N’ in RNA strands. These are Protein Data Bank conventions.” An example input is shown below:

    ATACCGATACG_TAGAC
      TG_CTATGCTATCTGT_

The NAB itself, and the standalone fd_helix.c program which supports 6 fiber-based models of DNA or RNA.

The NUCGEN program from the Bansal group. “The NUCGEN software generates double helical models with the backbone fixed in B-form DNA, but with appropriate modifications in the input data, it can also generate A-form DNA and RNA duplex structures.”

3DNA and its web interface. The ‘rebuild’ program can be used for constructing customized, single or duplex DNA/RNA structures based on a set of base-pair and step (helical) parameters. Moreover, the sugar-phosphate backbone in A-, B- or RNA conformation is allowed. The ‘fiber’ program incorporates a comprehensive list of 56 regular models, based mostly on fiber diffraction data. The list includes single, duplex, triplex, quadruplex, DNA, RNA structures or their hybrids. Notable, the classic Pauling’s triplex model is also available. The 3DNA web 2.0 makes these model-building features readily accessible to a large user base.

Overall, each of the tools listed above has its unique features and may fit better for different applications. It is to the benefit of the user community to have a choice.

Comment

misc tips and tricks

Nowadays, I’ve been used to google searches as a quick way to solve problems. Once in a while, I come across a tip or trick that fixes an issue at hand and then move on. However, I may late on meet a similar problem, but only vaguely remember how I solved it previously. So I’d need to start googling around again. This list is a remedy for such situations, and it will be continuously updated. While the list is created for my own reference, it may also be useful to other viewers of the post, presumably reaching here via google.

icdiff — show diff with color
scc — strip C comments
Taskwarrior (taks) — manage TODO list from terminal
httpie as a replacement of curl and wget
byebug and pry for debugging Ruby
ag to search for PATTERN in source files, replacing grep
fd to find files and directories
bat to view files with syntax highlighting (in place of cat)
exa as an alternative to ls
bench to benchmark code
asciinema and svg-term to record terminal activity as an SVG animation. Another option is termtosvg. Moreover, the trio ttyrec, ttyplay, ttygif can record, play terminal screen recordings, and convert it into smooth GIF
wrk to benchmark HTTP APIs
hub — git wrapper for GitHub
tail -n +2 to skip the first line (starting from the second line)
sudo -i -u user_id, the -i or --login option invokes login shell
Understanding Shell Script’s idiom: 2>&1 — redirect ‘stderr’ to ‘stdout’ via ’2>&1’ in bash shell.
Ruby one-liners
- ruby -pi.bak -e "gsub(/SOME_PATTERN/, 'other_text')" files for global replacement of SOME_PATTERN by other_text in files
- ruby -pe 'gsub(/_/, ".")' globally replace ‘_’ with ‘.’

Comment

Identification of nucleotides

Nucleic acids structural bioinformatics starts with the identification of nucleotides (nts) from atomic coordinates. As biopolymers, RNA and DNA have standard IUPAC names of atoms for the five bases (see the Figure below), sugars (ending with prime, e.g., C1’, O2’), and the phosphate (P, OP1, and OP2). The atomic coordinates (in PDB or mmCIF format) from the Protein Data Bank (PDB) follow the convention.

Trained as a chemist, I am aware that the bases are aromatic, heterocyclic compounds (purines and pyrimidines). Moreover, the five standard bases (A, C, G, T, and U) also share a six-membered ring, with atoms named consecutively (N1, C2, N3, C4, C5, C6). This special feature can be employed to identify nts automatically, from PDB atomic coordinates. The ring skeleton is not influenced by protonation states, tautomeric forms, or modifications in base, sugar or phosphate. Early versions of 3DNA (up to v2.0) used only N1, C2, and C6 atoms to identify an nt: an additional N9 as purine, otherwise as pyrimidine. In 3DNA v2.3 and DSSR, the procedure has been refined to take advantage of all available rings atoms. It is thus more robust against distortions and still works even when any of the N1, C2, C6, or N9 atoms are mutated or missing. This blog post provides further technical details on how the method works.

The template used to identify nts is a purine, with nine base ring atoms. Purine is chosen since it contains atoms of the six-membered ring and N7, C8, and N9. Its atomic coordinates in PDB format are shown below. The coordinates are taken from ‘G’ in the standard reference frame ($X3DNA/config/Atomic_G.pdb). Using ‘A’ as reference won’t make any difference since the RMSD between them is only 0.038 Å.

ATOM      1  N9    G A   1      -1.289   4.551   0.000  1.00  0.00           N
ATOM      2  C8    G A   1       0.023   4.962   0.000  1.00  0.00           C
ATOM      3  N7    G A   1       0.870   3.969   0.000  1.00  0.00           N
ATOM      4  C5    G A   1       0.071   2.833   0.000  1.00  0.00           C
ATOM      5  C6    G A   1       0.424   1.460   0.000  1.00  0.00           C
ATOM      6  N1    G A   1      -0.700   0.641   0.000  1.00  0.00           N
ATOM      7  C2    G A   1      -1.999   1.087   0.000  1.00  0.00           C
ATOM      8  N3    G A   1      -2.342   2.364   0.001  1.00  0.00           N
ATOM      9  C4    G A   1      -1.265   3.177   0.000  1.00  0.00           C

The nt-identification process begins with a mapping of at least three atoms in the purine, followed by a least-squares fit between corresponding atoms. For the five standard bases and most modified ones, the RMSD is normally less than 0.12 Å, as seen in the Figure below. Even the unsaturated dihydrouridine in tRNA has an RMSD of less than 0.25 Å: for the yeast phenylalanine tRNA (PDB id: e1ehz), for example, it is 0.205 Å for H2U-16, and 0.226 Å for H2U-17. DSSR uses a cutoff of 0.28 Å, covering essentially all nucleotides in the PDB. As an extreme case, the DA1 residue on chain T of PDB id 4ki4 has only three base atoms: N7, C8, and N9 (i.e., no atoms from the six-membered ring). With an RMSD of only 0.005 Å, DSSR still takes it as an nt, properly assigned as ‘A’.

Molecular dynamics (MD) simulations sometimes produce heavily distorted bases, which is over the default cutoff. Users may change the cutoff to a larger value to accommodate such unusual cases.

In addition to dihydrouridine, the above Figure also shows pseudouridine (PSU), 1-methyladenosine (1MA), 4-thiouridine (4SU), and the heavily modified YYG in tRNA. They are all easily identified using the same scheme. Since the nt-identification method concentrates on base rings, modifications in sugar or the phosphate group do not pose any problem. For example, in tRNA 1ehz, DSSR also identifies O2’-methylguanosine (OMG) and O2’-methylcytidine (OMC) as modified nts.

Two special cases worth mentioning. The ligand IMD in PDB id 1r8e has a five-membered ring. Its atoms are named similarly to those of an nt, and the fitted RMSD is only 0.29 Å. IMD can be filtered out by its missing of the C6 atom and having an N1—C5 covalent bond. The ligand SPM in PDB id 355d is a linear molecule, and its RMSD (1.86 Å) is clearly far off to be taken as an nt.

Another particular case (of a different kind) is the abasic sites, especially in X-ray crystal structures in the PDB. By definition, abasic sites do not have base atoms available. Thus the described method is not applicable to their characterization as nts. As of v1.7.3-2017dec26, however, DSSR has also incorporated abasic sites into the analysis pipeline, by default. The program checks backbone linkage and residue name for appropriate nt assignment. The abasic sites could constitute part of (internal) loops which would otherwise be broken into segments by DSSR.

Overall, I feel confident to say that 3DNA-DSSR has practically solved the problem of identifying nts from atomic coordinates. The method detailed herein (and outlined in the DSSR paper) is simple and easy to understand/implement. Moreover, it has been extensively tested in real-world applications for well over a decade. I’ve yet to find a single case where it does not work as expected.

Comment

« Older · Newer »

Thank you for printing this article from http://x3dna.org/. Please do not forget to visit back for more 3DNA-related information. — Xiang-Jun Lu

X3DNA-DSSR: a resource for structural bioinformatics of nucleic acids(An NIGMS National Resource supported by NIH grant R24GM153869)

Announcing wDSSR: The Next-Generation Web Interface to X3DNA-DSSR

X3DNA-DSSR: a resource for structural bioinformatics of nucleic acids
(An NIGMS National Resource supported by NIH grant R24GM153869)