Duplicating an RNA duplex using DSSR

Background and motivation

In late 2021, I came across the thread titled "create a 26 bp RNA from a 13 bp system" on the PyMOL mailing list. The thread began with a user asking:

I have an RNA duplex with 13 base-pairs (attached). Is it possible to duplicate this system and then fuse the two molecules to create a 26 base-pair long system using the pymol.

The message is both concise and clear. The attached 13 base-pair RNA duplex (named model.pdb) makes the task easier to understand. An expert PyMOL user responded quickly, providing a set of suggested PyMOL commands along with warnings about the complexity of the task.

No, not automatically. Your RNA is very distorted from the standard A-form. I doubt any modeling program can accurately extend such a distorted helix. Maybe someone else will prove me wrong. ... You can align the terminal base pairs manually through a series of commands. If you try by dragging one copy relative to another, you will wind up pulling out all of your hair. The commands and patience will keep you out of the mad house.

DSSR offers unique capabilities to automatically manipulate nucleic acid structures. It also enables the duplication of an RNA duplex, as specifically requested by the original poster. In my initial response to the thread, I provided a DSSR-based solution for duplicating the RNA duplex without detailed explanations, aiming to confirm whether the result met the user's needs. The feedback was positive, as indicated below:

Thanks for proving me wrong. Congratulations on your duplicated model! Please share the commands that you used with DSSR to generate the duplicated helix. --- from the PyMOL responder​


Thanks a lot for your help. The model you have duplicated is exactly what I am looking for (checked it with VMD). Unfortunately I do not have access to DSSR-Pro. Is there any way that I can reproduce your procedure with x3dna-dssr? I need to create different numbers of duplicates (2,4,6,5,8) for different systems and this will be very helpful. --- from the original poster

During that period (near the end of 2021), I was facing a funding gap. To address this challenge, we decided to license DSSR through Columbia Technology Ventures (CTV) and introduced a Pro version of DSSR for commercial users and academic institutions, providing advanced modeling features and dedicated support. Note that DSSR Pro Academic licenses entail a one-time fee of $1,020. The software can be installed on Windows, macOS, or Linux. While not explicitly included in the license agreement, I provide direct support to Pro license users via email, phone, or Zoom whatever convenient to help address their issues. I care about user experience, especially for those who invest in the Pro version.

Following user feedback, I shared detailed instructions on duplicating an RNA duplex using DSSR Pro. Gratefully, the original poster purchased a DSSR Pro Academic license and successfully duplicated the RNA helix. Later, we communicated via email to assist with other related tasks. This experience underscored the importance of engaging with the scientific community and addressing user needs to drive software development and adoption.

Detailed instructions

With funding from grant R24GM153869, I have transferred many DSSR Pro features into the free DSSR Academic version to better serve the scientific community. Included below are detailed step-by-step commands script for duplicating an RNA duplex using either DSSR Pro or the free DSSR Academic v2.5.2. The script runs instantaneously in a terminal window.

x3dna-dssr tasks -i=model.pdb --frame-pair=last -o=model1-ref-last.pdb

x3dna-dssr fiber --seq=GG --rna-duplex -o=conn.pdb
x3dna-dssr tasks -i=conn.pdb --frame-pair=first --remove-pair -o=ref-conn.pdb

x3dna-dssr tasks --merge-file='model1-ref-last.pdb ref-conn.pdb' -o=temp1.pdb

x3dna-dssr tasks -i=temp1.pdb --frame-pair=last --remove-pair -o=temp2.pdb
x3dna-dssr tasks -i=model.pdb --frame-pair=first -o=model1-ref-first.pdb

x3dna-dssr tasks --merge-file='temp2.pdb model1-ref-first.pdb' -o=duplicate-model.pdb

x3dna-dssr --order-residue -i=duplicate-model.pdb -o=temp3.pdb
x3dna-dssr --renumber-residue -i=temp3.pdb -o=temp4.pdb
x3dna-dssr --connect-file -i=temp4.pdb -o=RNA-duplicate.pdb

The procedure is essentially the same as the one used in "Building extended Z-DNA structures with backbones using DSSR". For completeness, I have included detailed explanations for each step here as well.

  1. Setting Up the Reference Frame:

    • The first command places the 13 base-pair RNA duplex (model.pdb) into the reference frame of its last base pair, resulting in model1-ref-last.pdb.
  2. Creating the Fiber Connector:

    • The fiber model is constructed using an RNA duplex (--rna-duplex) with the sequence GG on the leading strand (conn.pdb). This connector is oriented into the reference frame of its first base pair.
    • The first base pair is removed. Thus, the resulting coordinate file, ref-conn.pdb, contains only one pair.
    • Note: The sequence GG serves as a placeholder. It can be replaced with any other two bases: for instance, changing --seq=GG to --seq=AA. Moreover, using --seq=GA10G allows for creating a linker with 10 adenines.
  3. Merging PDB Files:

    • The two PDB files, model1-ref-last.pdb and ref-conn.pdb, share a common reference frame and are merged into a single file named temp1.pdb.
  4. Adjusting the Reference Frame:

    • The merged file (temp1.pdb) is then aligned with the last base pair, which is subsequently removed to produce temp2.pdb. This completes the role of the GG fiber connector.
  5. Reorienting the RNA Duplex:

    • The original 13 base-pair RNA duplex (model.pdb) is reoriented into the reference frame of its first base pair, generating model1-ref-first.pdb.
  6. Final Merging:

    • The two PDB files, temp2.pdb and model1-ref-first.pdb, contain identical 13 base-pair RNA duplexes but in different orientations. They are merged into a single file (duplicate-model), establishing the final duplicated RNA structure.
  7. Bookkeeping for Visualization:

The duplicated RNA helix is illustrated in the image below. duplicated RNA helix

Some caveats

The original 13 base-pair RNA duplex (model.pdb) contains three main PDB format inconsistencies:

  • Missing Chain Identifiers: The two strands lack proper chain identifiers in column 22.
  • Incorrect Covalent Bond Distance: Nucleotides RU25 and RC26 are not covalently linked. Specifically, the distance between O3' of RU25 and P of RC26 is 3.5 Å, exceeding the expected 1.6 Å for a proper covalent bond.
  • Misclassified Ligand Record: The ligand (LIG27) is incorrectly designated as ATOM instead of the appropriate HETATM record.

Comment

---

Building extended Z-DNA structures with backbones using DSSR

A recent thread on the 3DNA Forum discussed 'Rebuilding Z-DNA' by extending an existing structure. The 3DNA rebuild program allows users to generate DNA or RNA structures with any user-specific sequence and corresponding base-pair/step parameters. This process is rigorous for atomic coordinates of base (and C1') atoms: running analyze on the rebuilt structure will yield the same set of parameters that users initially input. For more details, see the 2003 3DNA paper, the 2015 DSSR paper, and the DSSR User Manual.

The challenge lies in modeling the backbones. For right-handed A- or B-form DNA, users can build full-atomic models with canonical backbone conformations of C3'-endo or C2’-endo sugar conformations and anti glycosidic bonds. However, left-handed Z-DNA has unique structural features—such as syn-G, CpG, and GpC dinucleotides as building blocks instead of single nucleotides—that are not fully addressed by the 3DNA rebuild program.

DSSR (Pro version or the Academic v2.5.2) offers a solution by providing tools to build extended Z-DNA structures with proper backbones. The commands are as follows:

x3dna-dssr -i=1qbj.pdb1 --select-chains='D E' --delete-water -o=model.pdb

x3dna-dssr tasks -i=model.pdb --frame-pair=last -o=model1-ref-last.pdb

# poly d(GC) : poly d(GC)
x3dna-dssr fiber --z-dna --repeat=1 -o=conn.pdb
x3dna-dssr tasks -i=conn.pdb --frame-pair=first --remove-pair -o=ref-conn.pdb

x3dna-dssr tasks --merge-file='model1-ref-last.pdb ref-conn.pdb' -o=temp1.pdb

x3dna-dssr tasks -i=temp1.pdb --frame-pair=last --remove-pair -o=temp2.pdb
x3dna-dssr tasks -i=model.pdb --frame-pair=first -o=model1-ref-first.pdb

x3dna-dssr tasks --merge-file='temp2.pdb model1-ref-first.pdb' -o=duplicate-model.pdb

x3dna-dssr --order-residue -i=duplicate-model.pdb -o=temp3.pdb --po-bond=3.6
x3dna-dssr --renumber-residue -i=temp3.pdb -o=temp4.pdb
x3dna-dssr --connect-file -i=temp4.pdb -o=1qbj-duplicate.pdb --po-bond=3.6

The logic behind these commands is very straightforward, but technical details may look a bit complex for the uninitiated:

  1. The first command extracts the Z-DNA duplex consisting of chains D and E from PDB entry 1qbj.pdb1 (the first biological unit) and remove water molecules (model.pdb). The Z-DNA duplex has sequence: CGCGCG/CGCGCG.
  2. The next command sets the Z-DNA duplex (model.pdb) into the reference frame of the last base pair, i.e., G-C (model1-ref-last.pdb).
  3. The fiber model consists of the GpC dinucleotide step (conn.pdb), which is then set into the reference frame of the first base pair (G-C). The first G-C pair is removed from the coordinate file ref-conn.pdb which consists of only one C-G pair.
  4. The two PDB files, model1-ref-last.pdb and ref-conn.pdb, share a common reference frame and are merged into a single PDB file (temp1.pdb).
  5. The merged PDB file (temp1.pdb) is then set into the reference frame of last base pair(i.e., C-G) which is removed from the resulting coordinate file (temp2.pdb). Now the job of the GpC fiber connector is done.
  6. The Z-DNA duplex (model.pdb) is once again set into the reference frame of the first base pair (i.e., C-G), leading to the coordinate file model1-ref-first.pdb.
  7. The two PDB files, temp2.pdb and model1-ref-first.pdb, both consist of the same Z-DNA duplex but are in different orientations. They now share a common reference frame and are merged into the extended Z-DNA duplex (1qbj-duplicate.pdb).
  8. The last three commands (with options --order-residue, --renumber-residue, --connect-file) are bookkeeping steps to ensure proper order and numbering of nucleotides along each chain, and generate the CONECT record for smooth view in PyMOL.

The final PDB coordinate file (1qbj-duplicate.pdb) can be downloaded, and visualized in DSSR-enabled cartoon-block representation as below:

Extended Z-DNA generated with DSSR

Comment

---

Mapping of modified nucleotides in DSSR

In January 29, 2025, I received the following email request from a long-time DSSR user:

... recently noted that 3DNA/DSSR automatically maps non-standard nucleotides to standard nucleotides. I wonder if you would be willing to share with us your most current version of mappings?

I responded to the user the same day, with detailed information about the mapping process in DSSR. The user was happy with my response, and that thread was quickly closed with a positive note.

On April 22, 2025, a related question, titled "Can x3dna-dssr correctly handle N1-methyl-pseudouridine?", was asked on the 3DNA Forum. In answering the question on the Forum, I referred to my email response to the previous user.

I now realize that writing a detailed blog post explaining the mapping process would be beneficial for DSSR users. It would also enable me to easily reference this blog post in future interactions with users.


3DNA/DSSR performs automatic mapping of modified nucleotides (including pseudouridine) to their standard counterparts. Over the years, the method has proven to work well in real-world applications. It is one of the defining features that make DSSR just work. For example, for the tRNA 1ehz, DSSR automatically identifies the following 14 modified nucleotides (of 11 unique types):

# x3dna-dssr -i=1ehz.pdb
List of 11 types of 14 modified nucleotides
      nt    count  list
   1 1MA-a    1    A.1MA58
   2 2MG-g    1    A.2MG10
   3 5MC-c    2    A.5MC40,A.5MC49
   4 5MU-t    1    A.5MU54
   5 7MG-g    1    A.7MG46
   6 H2U-u    2    A.H2U16,A.H2U17
   7 M2G-g    1    A.M2G26
   8 OMC-c    1    A.OMC32
   9 OMG-g    1    A.OMG34
  10 PSU-P    2    A.PSU39,A.PSU55
  11 YYG-g    1    A.YYG37

Users could run DSSR on a set of structures of interest, and collect the unique mappings for a complete list of modified nucleotides.

Moreover, DSSR has the --nt-mapping option that allows users to control the mapping process. The screenshot below is taken from the relevant part of the DSSR manual.

For example, DSSR automatically maps 5MU (5-methyluridine 5′-monophosphate) to t (i.e., modified thymine) because of the 5-methyl group. With the option --nt-mapping='5MU:u', DSSR would take 5MU as a modified uracil. This option allows for multiple mappings separated by comma. The mapping of 5MU to u or t is obviously arbitrary. DSSR is robust against the ambiguity in designating a modified nucleotide to its nearest canonical counterpart. For example, mapping 5MU to u or t has minimal influence on DSSR-derived base-pair parameters and other structural features.

DSSR nt-mapping option

Background information on the mapping

Over the years, I've refined the heuristics of the mapping process. In the early days with 3DNA, I kept an ever increasing list in file baselist.dat with hundreds of entries like: MIA a that maps MIA as a modified A, denoted as lowercase a. In recent releases of DSSR, I keep only the standard ones, with a total of 48 entries like ADE A, and DG5 G etc. If a residue is not a standard one, the following C function is called to do the mapping. DSSR performs filtering to decide if a residue is a nucleotide, and if so R (purine) or Y (pyrimidine).

static void derive_new_nt_std_name(long resi, struct_mol *pdb, char *info)  
{  
    char str[BUF512];  
    double d1 = DMAX, d2 = DMAX;  
    long C1_prime, N1, C5;  
    struct_residue *r = &pdb->residues[resi];  

    if (r->type[RESIDUE_NT_UNKNOWN]) {  
        sprintf(r->std_name, "__%c", Gvars.abasic);  
        return;  
    }  

    if (is_R(resi, pdb)) {  /* purine */  
        if (residue_has_atom(" O6 ", resi, pdb))  /* with ' O6 ' */  
            strcpy(r->std_name, "__g");  
        else if (!residue_has_atom(" N6 ", resi, pdb) &&  /* no ' N6 ' but ' N2 ' */  
                 residue_has_atom(" N2 ", resi, pdb))  
            strcpy(r->std_name, "__g");  
        else  
            strcpy(r->std_name, "__a");  

    } else {  /* a pyrimidine */  
        if (residue_has_atom(" N4 ", resi, pdb))  
            strcpy(r->std_name, "__c");  
        else if (residue_has_atom(" C7 ", resi, pdb))  
            strcpy(r->std_name, "__t");  
        else  
            strcpy(r->std_name, "__u");  

        C1_prime = find_atom_in_residue(" C1'", resi, pdb);  
        N1 = find_atom_in_residue(" N1 ", resi, pdb);  
        if (atoms_same_model_chain_altloc(C1_prime, N1, pdb))  
            d1 = dist_atoms(C1_prime, N1, pdb);  

        if (!dval_in_range(d1, 1.0, 2.0)) {  
            C5 = find_atom_in_residue(" C5 ", resi, pdb);  
            if (atoms_same_model_chain_altloc(C1_prime, C5, pdb))  
                d2 = dist_atoms(C1_prime, C5, pdb);  
            if (dval_in_range(d2, 1.0, 2.0))  
                strcpy(r->std_name, "__p");  
        }  
    }  

    if (!Gvars.standalone) {  
        sprintf(str, "\n\tmatched nucleotide '%s' to '%c' for %s\n"  
                "\tverify and add an entry in <baselist.dat>\n",  
                r->res_name, r->std_name[2], info);  
        logit(str);  
    }  
}

Comment

---

About alternate location (altloc) field

The legacy PDB format has a field called “altLoc” (alternate location indicator) for "ATOM/HETATM" records in the "Coordinate Section". The corresponding documentation is excerpted below:

COLUMNS        DATA  TYPE    FIELD        DEFINITION
-----------------------------------------------------------------------
17             Character     altLoc       Alternate location indicator.
  • AltLoc is the place holder to indicate alternate conformation. The alternate conformation can be in the entire polymer chain, or several residues or partial residue (several atoms within one residue). If an atom is provided in more than one position, then a non-blank alternate location indicator must be used for each of the atomic positions. Within a residue, all atoms that are associated with each other in a given conformation are assigned the same alternate position indicator. There are two ways of representing alternate conformation- either at atom level or at residue level (see examples).

  • For atoms that are in alternate sites indicated by the alternate site indicator, sorting of atoms in the ATOM/HETATM list uses the following general rules:
    • In the simple case that involves a few atoms or a few residues with alternate sites, the coordinates occur one after the other in the entry.
    • In the case of a large heterogen groups which are disordered, the atoms for each conformer are listed together.

In mmCIF format, AltLoc is under the atom_site category, with attribute name label_alt_id: i.e., labelled as _atom_site.label_alt_id. It is a required data item and appears in 43% of entries in the PDB.

In 3DNA and DSSR, AltLoc has a default value of "A1 ": an atom is taken into consideration if its AltLoc field (a single character) is space, A, or 1, otherwise it is ignored. Note that for mmCIF format, AltLoc field with dot (.) or question mark (?) is taken as space. Customized AltLoc values can be set via the --altloc option in DSSR.

Here is an example. PDB entry 7o1h is a 31-mer synthetic construct, with a hybrid-2R quadruplex-duplex of 3(-P-P-Lw) topology and three syn guanosines. It contains two modified residues designated BGM (BGM26 and BGM28), 8-bromo-2'-deoxyguanosine-5'-monophosphate, with AltLoc set to B. By default, DSSR detects only one G-tetrad, consisting of DG5,DG9,DG13,DG27, ignoring the two G-tetrads with BGM26 and BGM28. With the --altloc=B option (space is always included), all three G-tetrads are detected and the G-quadruplex (a G4-stem) is then automatically annotated as 3(-P-P-Lw):

# x3dna-dssr -i=7o1h-assembly1.cif --altloc=B

List of 2 types of 3 modified nucleotides
      nt    count  list
   1 BGM-g    2    A.BGM26,A.BGM28
   2 THM-t    1    A.THM1

List of 1 G4-stem
  Note: a G4-stem is defined as a G4-helix with backbone connectivity.
        Bulges are also allowed along each of the four strands.
  stem#1[#1] layers=3 INTRA-molecular loops=3 descriptor=3(-P-P-Lw) note=hybrid-2R(3+1) UUUD hybrid-(mixed)
   1  glyco-bond=---s sugar=---. groove=--wn WC-->Major nts=4 GGGg A.DG4,A.DG8,A.DG12,A.BGM28
   2  glyco-bond=---s sugar=--.- groove=--wn WC-->Major nts=4 GGGG A.DG5,A.DG9,A.DG13,A.DG27
   3  glyco-bond=---s sugar=---- groove=--wn WC-->Major nts=4 GGGg A.DG6,A.DG10,A.DG14,A.BGM26
    step#1  pm(>>,forward)  area=13.57 rise=3.39 twist=26.7
    step#2  pm(>>,forward)  area=12.00 rise=3.44 twist=28.4
    strand#1  U DNA glyco-bond=--- sugar=--- nts=3 GGG A.DG4,A.DG5,A.DG6
    strand#2  U DNA glyco-bond=--- sugar=--- nts=3 GGG A.DG8,A.DG9,A.DG10
    strand#3  U DNA glyco-bond=--- sugar=-.- nts=3 GGG A.DG12,A.DG13,A.DG14
    strand#4  D DNA glyco-bond=sss sugar=.-- nts=3 gGg A.BGM28,A.DG27,A.BGM26
    loop#1 type=propeller strands=[#1,#2] nts=1 T A.DT7
    loop#2 type=propeller strands=[#2,#3] nts=1 T A.DT11
    loop#3 type=lateral   strands=[#3,#4] nts=11 ACGCGCAGCGT A.DA15,A.DC16,A.DG17,A.DC18,A.DG19,A.DC20,A.DA21,A.DG22,A.DC23,A.DG24,A.DT25

See G4.x3dna.org for DSSR-enabled annotation and visualization of this G4 structure. Here is the G4-stem in the frame of reference of 5' DG4 (bottom right), following the convention of Dvorkin et al. (2018). It is orientated automatically based on the standard base-reference frame (Olson et al. (2001)) of DG4.

DB entry `7o1h`

References:

  • Dvorkin, Scarlett A., Andreas I. Karsisiotis, and Mateus Webba Da Silva. 2018. “Encoding Canonical DNA Quadruplex Structure.” Science Advances 4 (8): eaat3007. https://doi.org/10.1126/sciadv.aat3007.
  • Olson, Wilma K, Manju Bansal, Stephen K Burley, Richard E Dickerson, Mark Gerstein, Stephen C Harvey, Udo Heinemann, et al. 2001. “A Standard Reference Frame for the Description of Nucleic Acid Base-Pair Geometry.” Journal of Molecular Biology 313 (1): 229–37. https://doi.org/10.1006/jmbi.2001.4987.

Comment

---

Over 4-char long chain identifiers in the PDB

DNA and RNA are biological macromolecules consisting of long chains of nucleotides. In PDB coordinate files, each DNA/RNA chain is assigned a unique identifier. For the legacy PDB format, the size of the chain identifier is clearly defined to be one alphanumeric character. For the mmCIF format, the length of the chain identifier is flexible: it is normally up to 4-char long, but assembly files can have chain identifiers longer than 4 characters (as of May 2022, see examples).

Recently, I was approached with the following bug report where DSSR v2.4-2021nov11 was used:

Processing file '8feo-assembly1.cif'

[i] '8feo-assembly1.cif' taken as in .cif format by file extension.

*** buffer overflow detected ***: terminated

Aborted

I ran a newer version of DSSR (including the current release v2.5.2-2025apr03) on 8feo-assembly1.cif without any issue, as shown below:

# x3dna-dssr -i=8feo-assembly1.cif -o=8feo.out
[i] '8feo-assembly1.cif' taken as in .cif format by file extension.

Processing file '8feo-assembly1.cif'
[w] chain id 'AAA-2' > 4 chars
...

# Excerpt from 8feo.out
    no. of DNA/RNA chains: 2 [AAA=16,AAA-2=16]
    no. of nucleotides:    32
...
List of 16 base pairs
     nt1            nt2            bp  name        Saenger   LW   DSSR
   1 AAA.A1         AAA-2.U16      A-U WC          20-XX     cWW  cW-W
   2 AAA.G2         AAA-2.C15      G-C WC          19-XIX    cWW  cW-W
   3 AAA.A3         AAA-2.U14      A-U WC          20-XX     cWW  cW-W

The message [w] chain id 'AAA-2' > 4 chars is saying that the chain identifier ‘AAA-2’ is out of the 4-char limit.

In addition to 8feo, similar issues were also fixed for related PDB entries 5a79, 6a7a, 8feo, 8fep, 8feq, 7umk, and 4v3p. Note that 4v3p is a eukaryotic polyribosomal assembly which takes several hours to run on a MacBookPro with 32GB RAM.


Some background information on how DSSR handles chain identifiers for mmCIF format files

When mmCIF support was first added to DSSR in 2013, I hard-coded the chain identifier to 4 chars following the documentation. In early 2024, when running DSSR on weekly updated PDB entries, I noticed a core dump bug with PDB entry 8feo for its biological assembly 1. At that time, I was not aware of the update on mmCIF-Formatted Assembly Files and its expansion of chain identifiers for symmetry-related copies: with PDB assembly files, -# is appended to any chain that is generated by a symmetry operation. So if the base chain id has 3 chars (e.g., AAA), the symmetry related chain will have 5 chars (e.g., AAA-2).

That is the case for PDB entry 8feo: it has a chain with identifier AAA-2 which is symmetry-related to the asymmetric unit chain AAA. Since AAA-2 (5-char long) is above the hard-coded 4-char limit, DSSR crashed (out of array boundary in C). After recognizing the issue, I've increased the chain identifier limit in DSSR to 8 chars, more than enough for all current PDB entries. Moreover, DSSR performs sanity check of chain identifier length: it reports diagnostic message as shown above for chains with over 4-char identifiers (e.g., AAA-2), and automatically shortens long chains to the enlarged limit. DSSR is now more robust and user friendly: it no longer simply crashes, but communicates helpful info about unusual cases to draw users' attention.

Taking this opportunity, I have also proactively updated DSSR to support long atom names , residue names, and segment ids, in preparation for future id changes. Tracing issues to their root causes and fixing them systematically is a key part that makes DSSR a reliable tool for structural bioinformatics. Tests have been added to the quality control infrastructure to ensure that all these new features work as expected.

Nowadays, the vast majority (over 90%) of users’ questions about DSSR can be answered straight away simply because they have already been addressed in advance, as shown in the above example for long chain identifiers. I'm always on the lookout for issues reported on the 3DNA Forum, received from email, Zoom, or in person, and more systematically via DSSR update on weekly released PDB entries, and uploaded files on the web-services (g4.x3dna.org and skmatic.x3dna.org). Every issue is an opportunity to further polish DSSR and make it better. Overall, users’ feedback is invaluable to me: I take it as an asset instead of a burden.


Documentation on chain identifiers in PDB and mmCIF formats

PDB format

The Coordinate Section in the PDB format documentation contains the following for ATOM/HETATM records:

The ATOM records present the atomic coordinates for standard amino acids and nucleotides. ... Non-polymer chemical coordinates use the HETATM record type.

Record Format
COLUMNS        DATA  TYPE    FIELD        DEFINITION
-----------------------------------------------------------
22             Character     chainID      Chain identifier.
**Details**
- Non-blank alphanumerical character is used for chain identifier.

So the chain identifier in PDB format is a single alphanumeric character in column 22 of the ATOM/HETATM records.

mmCIF format

  • Large Structures Represented in mmCIF/PDBx: Chain identifiers of up to 4 characters are permitted. The PDB chain identifier corresponds to the "_atom_site.auth_asym_id" data item.

  • News item on Distributing PDBx/mmCIF-Formatted Assembly Files

  • Github repo on Sample assembly files in PDBx/mmCIF Format

    These updated PDBx/mmCIF format assembly files files will include all symmetry generated copies of each chain within a single model, with distinct chain IDs assigned to each. Generation of distinct chain IDs in assembly files are based upon the following rules:

    • Chain IDs of the original chains from the atomic coordinate file will be retained (e.g., A)
    • Assign unique chain ID for each symmetry copy within a single model. Rules of chain ID assignments:
      • The applied index of the symmetry operator will be appended to the original chain ID separated by a dash (e.g., A-2, A-3, etc.)
      • If there are more than one type of symmetry operators applied to generate symmetry copy, a dash sign will be used between two operators (e.g., A-12-60, A-60-88, etc.)

Comment

---

Water-mediated base pairs

Recently, I came across the paper by Mitra et al. (2025) titled "RNAproDB: A Webserver and Interactive Database for Analyzing Protein-RNA Interactions." I am glad to notice that DSSR (Lu et al. 2015) has been cited extensively in this work, as follows:

As part of the processing pipeline, multiple software is run including DSSR^12^ (base-pairing geometries, protein–RNA hydrogen bonds, and RNA secondary structure), HBPLUS^17^ (hydrogen bonds involving water molecules), ... Leontis-Westhof^27^ base pair annotations (as computed by DSSR^12^) ... The structural elements (stems, loops, hairpins, junctions, etc.) are detected using DSSR^12^ and mapped to the partial projection layout (via averaging corresponding residue coordinates)... We explored the relative abundance of different standard nucleotides (A, C, G, and U) in helical vs. non-helical regions (as computed by DSSR^12^)...We quantified the propensity of base-pairing (as detected by DSSR^12^) between different RNA bases (Fig. 3D).

This is an impressive contribution on the characterization of protein-RNA interactions. Reading carefully through the paper and its supplemental PDF, I was intrigued by the following note on a water-mediated U-U base pair missed by DSSR.

Another important aspect to discuss is RNA–RNA water-mediated interactions^33,34^. ... One such example is the CUG repeat structure from PDB ID 7Y2B^35^ (Fig. S5A). The U/U mismatches in this structure are often unable to form direct hydrogen bonds (specifically, the central U/U mismatch forms no direct hydrogen bond). Therefore, DSSR^12^ does not classify it as a base pair. However, two water molecules form water-mediated hydrogen bonds between the two U bases. ...

While DSSR internally already takes consideration of water-mediated H-bonds in the detection of base pairs, it still requires: (1) at least one direct H-bond between two base atoms or a base atom to backbone, and (2) a co-planar geometry between the two bases. The water-mediated U7-U7 pair in PDB entry 7Y2B does not fulfill condition (1): the minimal distance between the two U bases is 5 Å, which is far larger than a typical H-bonding distance. Therefore, DSSR did not classify it as a base pair.

Prompted by the observation of Mitra et al. (2025), I have added a new option (--pair-water) in the DSSR v2.5.1-2025mar19 release to allow for water-mediated base pairs to be detected. Using PDB entry 7Y2B as an example, the DSSR command and related base-pairs output are shown below.

# x3dna-dssr -i=7Y2B.pdb1 --symm --pair-water

List of 13 base pairs
     nt1            nt2            bp  name        Saenger   LW   DSSR
   1 1:S.U1         2:S.A13        U-A WC          20-XX     cWW  cW-W
   2 1:S.U2         2:S.A12        U-A WC          20-XX     cWW  cW-W
   3 1:S.C3         2:S.G11        C-G WC          19-XIX    cWW  cW-W
   4 1:S.U4         2:S.U10        U-U --          --        cWW  cW-W
   5 1:S.G5         2:S.C9         G-C WC          19-XIX    cWW  cW-W
   6 1:S.C6         2:S.G8         C-G WC          19-XIX    cWW  cW-W
   7 1:S.U7         2:S.U7         U-U Water       --        cWW  cW-W
   8 1:S.G8         2:S.C6         G-C WC          19-XIX    cWW  cW-W
   9 1:S.C9         2:S.G5         C-G WC          19-XIX    cWW  cW-W
  10 1:S.U10        2:S.U4         U-U --          --        cWW  cW-W
  11 1:S.G11        2:S.C3         G-C WC          19-XIX    cWW  cW-W
  12 1:S.A12        2:S.U2         A-U WC          20-XX     cWW  cW-W
  13 1:S.A13        2:S.U1         A-U WC          20-XX     cWW  cW-W

Base pair #7 is water-mediated, as shown in the molecular image below. Note that .pdb1 means biological unit 1, and the option --symm reads the two symmetry-related structures in the MODEL/ENDMDL delineated ensemble as a single structure. See the DSSR User Manual for more details.

Water-mediated U-U pair detected by DSSR in PDB entry 7Y2B. Red spheres represent water molecules.

References

  • Lu, Xiang-Jun, Harmen J. Bussemaker, and Wilma K. Olson. 2015. “DSSR: An Integrated Software Tool for Dissecting the Spatial Structure of RNA.” Nucleic Acids Research, July, gkv716. https://doi.org/10.1093/nar/gkv716.

  • Mitra, Raktim, Ari S. Cohen, Wei Yu Tang, Hirad Hosseini, Yongchan Hong, Helen M. Berman, and Remo Rohs. 2025. “RNAproDB: A Webserver and Interactive Database for Analyzing Protein-RNA Interactions.” Journal of Molecular Biology, February, 169012. https://doi.org/10.1016/j.jmb.2025.169012.

Comment

---

The --structure-title option for DSSR .ct output

DSSR produces RNA secondary structures in connect table (.ct) format. According to "RNAstructure Command Line Help: File Formats" (with slight editing):


CT File Format

A CT (Connectivity Table) file contains secondary structure information for a sequence. These files are saved with a CT extension. When entering a structure to calculate the free energy, the following format must be followed.

  1. Start of first line: number of bases in the sequence
  2. End of first line: title of the structure
  3. Each of the following lines provides information about a given base in the sequence. Each base has its own line, with these elements in order:
    • Base number: index n
    • Base (A, C, G, T, U, X)
    • Index n-1
    • Index n+1
    • Number of the base to which n is paired. No pairing is indicated by 0 (zero).
    • Natural numbering. RNAstructure ignores the actual value given in natural numbering, so it is easiest to repeat n here.

Using PDB entry 1msy as an example (see Figure 1 below):


1msy-in-3d-2d

Figure 1. 3D and 2D structures of PDB entry 1msy. (A) 3D schematic auto-created via the DSSR-PyMOL integration. The labeled residues follow PDB coordinates. (B) 2D diagram rendered with VARNA using DSSR-derived 2D structural information in the .ct format. This figure was annotated using Inkscape.


With commands:

x3dna-dssr -i=1msy.pdb
cp dssr-2ndstrs.ct 1msy-dssr-default.ct

The file 1msy-dssr-default.ct has the following contents:

   27 ENERGY = 0.0 [1msy] -- secondary structure derived by DSSR
    1 U     0     2     0  2647
    2 G     1     3    26  2648
    3 C     2     4    25  2649
    4 U     3     5    24  2650
    5 C     4     6    23  2651
    6 C     5     7    22  2652
    7 U     6     8     0  2653
    8 A     7     9     0  2654
    9 G     8    10     0  2655
   10 U     9    11     0  2656
   11 A    10    12     0  2657
   12 C    11    13    17  2658
   13 G    12    14     0  2659
   14 U    13    15     0  2660
   15 A    14    16     0  2661
   16 A    15    17     0  2662
   17 G    16    18    12  2663
   18 G    17    19     0  2664
   19 A    18    20     0  2665
   20 C    19    21     0  2666
   21 C    20    22     0  2667
   22 G    21    23     6  2668
   23 G    22    24     5  2669
   24 A    23    25     4  2670
   25 G    24    26     3  2671
   26 U    25    27     2  2672
   27 G    26     0     0  2673

Here the first line contains 27 (as the number of bases) and ENERGY = 0.0 [1msy] -- secondary structure derived by DSSR (as the title). While RNAstructure ignores the actual values given in natural numbering, DSSR outputs the residue numbers of the nucleotides (e.g. U2467 and G2673) in the PDB file.

With the DSSR option --structure-title (or --str-title, actually via regex "^-?-?str(ucture)?[-_]?title"), users can set the title for the derived .ct file, as shown below:

x3dna-dssr -I=1msy.pdb --structure-title='CT file derived from DSSR'
cp dssr-2ndstrs.ct 1msy-dssr-title.ct

   27 CT file derived from DSSR
    1 U     0     2     0  2647
    2 G     1     3    26  2648
......
   26 U    25    27     2  2672
   27 G    26     0     0  2673

One can also remove the title, by using an empty string "" (i.e., --str-title="") or simply --str-title (or --str-title=).

x3dna-dssr -I=1msy.pdb --structure-title=""
cp dssr-2ndstrs.ct 1msy-dssr-notitle.ct

   27
    1 U     0     2     0  2647
    2 G     1     3    26  2648
......

With the --more option, DSSR also outputs additional info that can be used to easily identify a nucleotide and its pairing partner.

x3dna-dssr -I=1msy.pdb --more --structure-title="1msy with extra info"
cp dssr-2ndstrs.ct 1msy-dssr-extra.ct

   27 1msy with extra info
    1 U     0     2     0  2647 # name=A.U2647
    2 G     1     3    26  2648 # name=A.G2648, pairedNt=A.U2672
    3 C     2     4    25  2649 # name=A.C2649, pairedNt=A.G2671
......

Note that unlike for the .bpseq format with extra info which cannot be fed directly into VARNA, the extra info for the .ct format causes no troubles for VARNA to visualize the 2d structure.

The --structure-title option is another small feature implemented in DSSR. It is currently not documented in the DSSR User Manual since this feature is unlikely of general interest.


DSSR commands used, and the output .ct files:

x3dna-dssr -i=1msy.pdb
cp dssr-2ndstrs.ct 1msy-dssr-default.ct

x3dna-dssr -I=1msy.pdb --structure-title='CT file derived from DSSR'
cp dssr-2ndstrs.ct 1msy-dssr-title.ct

x3dna-dssr -I=1msy.pdb --structure-title=""
cp dssr-2ndstrs.ct 1msy-dssr-notitle.ct

x3dna-dssr -I=1msy.pdb --more --structure-title="1msy with extra info"
cp dssr-2ndstrs.ct 1msy-dssr-extra.ct

Comment

---

The `--bpseq` option in DSSR

By default, DSSR produces RNA secondary structures in three commonly used file formats––ViennaRNA package dbn, Mfold connect table (.ct), and CRW bpseq––that can be fed directly into visualization tools such as VARNA. In this blog post, I want to dig deeper into the bpseq format, and show the variations available from DSSR.

According to "RNA STRAND v2.0 - The RNA secondary STRucture and statistical ANalysis Database" (with slight editing):


BPSEQ format:The file name should end with the suffix ".bpseq", as in "mystr.bpseq". The bpseq format is a simple text format in which there is one line per base in the molecule, listing the position of the base (leftmost position is 1), the base name (A,C,G,U, or other alphabetical characters), and the position number of the base to which it is paired, with a 0 denoting that the base is unpaired. For more information, see the Comparative RNA Web Site. An example is as follows:

1 G 8 
2 G 7 
3 C 0 
4 A 0 
5 U 0  
6 U 0 
7 C 2 
8 C 1 

For complexes with more than one molecule, the molecules are listed in sequence, with the base pairs numbers of each successive molecule following in order from the previous molecule.


The bases in bpseq format are identified by position numbers starting from 1 for the leftmost position. That is the convention DSSR follows by default in its .bpseq output. For example, for the PDB entry 1msy, which contains 27 nucleotides, the command x3dna-dssr -i=1msy.pdb will generate a file named dssr-2ndstrs.bpseq with the following contents (abbreviated):

    1 U     0
    2 G    26
    3 C    25
......
   25 G     3
   26 U     2
   27 G     0

However, according to PDB atomic coordinates, the nucleotides are numbered from U2647 (#1) to G2673 (#27) as shown in the Figure 1 below:


1msy-in-3d-2d

Figure 1. 3D and 2D structures of PDB entry 1msy. (A) 3D schematic auto-created via the DSSR-PyMOL integration. The labeled residues follow PDB coordinates. (B) 2D diagram rendered with VARNA using DSSR-derived 2D structural information in the .ct format. This figure was annotated using Inkscape.


It makes sense that the labelling of bases in the 2D bpseq format follows those from the 3D atomic coordinates in the PDB. Thus instead of starting from position 1 as shown above, the bpseq file would start with 2647. That's exactly what the DSSR --bpseq option is created for. Thus, with the command x3dna-dssr -i=1msy.pdb --bpseq, the output file dssr-2ndstrs.bpseq now has the following contents (abbreviated):

  2647 U      0
  2648 G   2672
  2649 C   2671
......
  2671 G   2649
  2672 U   2648
  2673 G      0

This .bpseq file can be read by VARNA (tested with VARNAv3-93.jar) to generate a 2D image as shown in Figure 1(B) above.

Moreover, with the command x3dna-dssr -i=1msy.pdb --bpseq=extra, the output file dssr-2ndstrs.bpseq now contains additional info to easily identify a nucleotide and its pairing partner:

  2647 U      0 # name=A.U2647
  2648 G   2672 # name=A.G2648, pairedNt=A.U2672
  2649 C   2671 # name=A.C2649, pairedNt=A.G2671
......
  2671 G   2649 # name=A.G2671, pairedNt=A.C2649
  2672 U   2648 # name=A.U2672, pairedNt=A.G2648
  2673 G      0 # name=A.G2673

It should be noted that this .bpseq output file is no longer compliant to the standard, and can not be fed into VARNA for visualization.

The --bpseq option has been added upon users' request. The --bpseq=extra variation was implemented recently to ensure that the --bpseq option by itself produce a valid .bpseq file without extra info (e.g., enabled with the global --more option). Now the extra info for .bpseq output is enabled only by setting --bpseq=extra explicitly.

This --bpseq option and its evolution is a good example of how DSSR responds to community requests. I'm here to listen and I'm always willing to improve DSSR that better fit users' needs. If you make use of DSSR in your pipeline and need some adaptions, please do not hesitate to contact me. I may consider adding a new option or revising the code otherwise that would streamline the integration of DSSR into your project.


DSSR commands used, and the output .bpseq files:

x3dna-dssr -i=1msy.pdb
cp dssr-2ndstrs.bpseq 1msy-dssr-default.bpseq

x3dna-dssr -i=1msy.pdb --bpseq
cp dssr-2ndstrs.bpseq 1msy-dssr-bpseq.bpseq

x3dna-dssr -i=1msy.pdb --bpseq=extra
cp dssr-2ndstrs.bpseq 1msy-dssr-bpseq-extra.bpseq

Comment

---

Over 5000 registrations on the 3DNA Forum

As I am writing this blogpost on June 26, 2020, the registrations on the 3DNA Forum has reached 5,054. The numbers were 3,000 on October 15, 2016, 2,000 on on February 3, 2015, and 1,000 on February 27, 2013 respectively. For year 2020, the monthly registrations are 36 (January), 35 (February), 54 (March), 84 (April), 69 (May). As of June 26, the number is 56, which will more than likely pass 60 by the end of this month. The Covid-19 pandemic does not seem to having a negative effect on the registrations.

The over 5,000 registrations are from users all over the world. The 3DNA Forum remains spam free, and all questions are promptly answered. It is functioning well; certainly better than I originally imagined.

Overall, the Forum serves as a virtual platform for me to interact effectively with the ever-increasing user community. I greatly enjoy answering questions, fixing bugs, and making 3DNA/DSSR/SNAP better tools for real-world applications.

---

« Older ·

Thank you for printing this article from http://x3dna.org/. Please do not forget to visit back for more 3DNA-related information. — Xiang-Jun Lu